Обсуждение: Why won't it index scan?

Поиск
Список
Период
Сортировка

Why won't it index scan?

От
"Ed L."
Дата:
Can someone help me understand why the 8.1.2 query below is
using a seq scan instead of an index scan?  All relevant
columns appear to be indexed and all tables vacuum analyzed.

$ psql -c "analyze verbose patient"
INFO:  analyzing "public.patient"
INFO:  "patient": scanned 3000 of 3353 pages, containing 117685 live rows and 5471 dead rows; 3000 rows in sample,
131533estimated total rows 
ANALYZE

$ psql -c "select count(1) from patient"
                  
 count
--------
 131661
(1 row)


$ psql -c "analyze verbose visit"
INFO:  analyzing "public.visit"
INFO:  "visit": scanned 3000 of 19985 pages, containing 58520 live rows and 7045 dead rows; 3000 rows in sample, 389841
estimatedtotal rows 
ANALYZE

$ psql -c "select count(1) from visit"
 count
--------
 389102
(1 row)


$ psql -c "explain analyze select * from visit inner join patient on patient.key = visit.patient_key where
nursestation_key= '40';" 
                                                                    QUERY PLAN
                           

---------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=27724.37..28457.01 rows=4956 width=421) (actual time=1819.993..2004.802 rows=6 loops=1)
   Merge Cond: ("outer".patient_key = "inner"."key")
   ->  Sort  (cost=11859.31..11871.70 rows=4956 width=209) (actual time=0.416..0.426 rows=6 loops=1)
         Sort Key: visit.patient_key
         ->  Bitmap Heap Scan on visit  (cost=69.35..11555.14 rows=4956 width=209) (actual time=0.187..0.245 rows=6
loops=1)
               Recheck Cond: (nursestation_key = 40)
               ->  Bitmap Index Scan on idx_visit_nursestation_key  (cost=0.00..69.35 rows=4956 width=0) (actual
time=0.158..0.158rows=6 loops=1) 
                     Index Cond: (nursestation_key = 40)
   ->  Sort  (cost=15865.05..16194.21 rows=131661 width=212) (actual time=1768.501..1856.334 rows=61954 loops=1)
         Sort Key: patient."key"
         ->  Seq Scan on patient  (cost=0.00..4669.61 rows=131661 width=212) (actual time=0.010..355.299 rows=131661
loops=1)
 Total runtime: 2046.323 ms
(12 rows)


                                         Table "public.patient"
      Column       |            Type             |                       Modifiers
-------------------+-----------------------------+-------------------------------------------------------
 key               | integer                     | not null default nextval('patient_key_seq'::regclass)
...
Indexes:
    "pk_patient" PRIMARY KEY, btree ("key")
...

                                           Table "public.visit"
        Column         |            Type             |                      Modifiers
-----------------------+-----------------------------+-----------------------------------------------------
 patient_key           | integer                     | not null
 nursestation_key      | integer                     |
...
Indexes:
    "idx_visit_nursestation_key" btree (nursestation_key)
    "idx_visit_patient_key" btree (patient_key)

Re: Why won't it index scan?

От
"John D. Burger"
Дата:
Ed L. wrote:

> Can someone help me understand why the 8.1.2 query below is
> using a seq scan instead of an index scan?

Because the planner thinks a sequential scan would be faster than an
index scan - in many situations, this is the case.  See the FAQ:

   http://www.postgresql.org/docs/faqs.FAQ.html#item4.6

- John Burger
   MITRE


Re: Why won't it index scan?

От
"Ed L."
Дата:
On Wednesday May 17 2006 10:37 am, Ed L. wrote:
> Can someone help me understand why the 8.1.2 query below is
> using a seq scan instead of an index scan?  All relevant
> columns appear to be indexed and all tables vacuum analyzed.
>
>
> $ psql -c "explain analyze select * from visit inner join
> patient on patient.key = visit.patient_key where
> nursestation_key = '40';" QUERY PLAN
> --------------------------------------------------------------
>---------------------------------------------------------------
>---------------------- Merge Join  (cost=27724.37..28457.01
> rows=4956 width=421) (actual time=1819.993..2004.802 rows=6
> loops=1) Merge Cond: ("outer".patient_key = "inner"."key")
>    ->  Sort  (cost=11859.31..11871.70 rows=4956 width=209)
> (actual time=0.416..0.426 rows=6 loops=1) Sort Key:
> visit.patient_key
>          ->  Bitmap Heap Scan on visit  (cost=69.35..11555.14
> rows=4956 width=209) (actual time=0.187..0.245 rows=6 loops=1)
> Recheck Cond: (nursestation_key = 40)
>                ->  Bitmap Index Scan on
> idx_visit_nursestation_key  (cost=0.00..69.35 rows=4956
> width=0) (actual time=0.158..0.158 rows=6 loops=1) Index Cond:
> (nursestation_key = 40) ->  Sort  (cost=15865.05..16194.21
> rows=131661 width=212) (actual time=1768.501..1856.334
> rows=61954 loops=1) Sort Key: patient."key"
>          ->  Seq Scan on patient  (cost=0.00..4669.61
> rows=131661 width=212) (actual time=0.010..355.299 rows=131661
> loops=1) Total runtime: 2046.323 ms
> (12 rows)


Increasing statistics target yielded index scan.

How can I best find optimal statistics target to ensure 100%
index scan?

Ed

Re: Why won't it index scan?

От
"Ed L."
Дата:
On Wednesday May 17 2006 11:44 am, Ed L. wrote:
> On Wednesday May 17 2006 10:37 am, Ed L. wrote:
> > Can someone help me understand why the 8.1.2 query below is
> > using a seq scan instead of an index scan?  All relevant
> > columns appear to be indexed and all tables vacuum analyzed.
> >
> >
> > $ psql -c "explain analyze select * from visit inner join
> > patient on patient.key = visit.patient_key where
> > nursestation_key = '40';" QUERY PLAN
> > ------------------------------------------------------------
> >--
> > ------------------------------------------------------------
> >--- ---------------------- Merge Join
> > (cost=27724.37..28457.01 rows=4956 width=421) (actual
> > time=1819.993..2004.802 rows=6 loops=1) Merge Cond:
> > ("outer".patient_key = "inner"."key") ->  Sort
> > (cost=11859.31..11871.70 rows=4956 width=209) (actual
> > time=0.416..0.426 rows=6 loops=1) Sort Key:
> > visit.patient_key
> >          ->  Bitmap Heap Scan on visit
> > (cost=69.35..11555.14 rows=4956 width=209) (actual
> > time=0.187..0.245 rows=6 loops=1) Recheck Cond:
> > (nursestation_key = 40)
> >                ->  Bitmap Index Scan on
> > idx_visit_nursestation_key  (cost=0.00..69.35 rows=4956
> > width=0) (actual time=0.158..0.158 rows=6 loops=1) Index
> > Cond: (nursestation_key = 40) ->  Sort
> > (cost=15865.05..16194.21 rows=131661 width=212) (actual
> > time=1768.501..1856.334 rows=61954 loops=1) Sort Key:
> > patient."key"
> >          ->  Seq Scan on patient  (cost=0.00..4669.61
> > rows=131661 width=212) (actual time=0.010..355.299
> > rows=131661 loops=1) Total runtime: 2046.323 ms
> > (12 rows)
>
> Increasing statistics target yielded index scan.
>
> How can I best find optimal statistics target to ensure 100%
> index scan?

I'm trying to understand what happened here, and I have a theory.
There are 389K rows total, and 262K rows with a null indexed
value.  Their are 15164 non-null rows newer than those null
rows.  When stats target is set to 50 or less, analyze scans
15,000 rows or less.  If it scans the newest rows/pages first,
then is it possible it never sees any hint of the 262K null
rows, and thus ends up with skewed stats that yield seq scans
when idx scan is in order?  If stat target is > 50, analyze
begins to include non-null rows in stat sample, yielding idx
scans.

Also, I see the most_common_vals array is not growing linearly
with the stats target as the docs seem to suggest.  I have 34
unique values, so with stats target >= 34, I'd expect
most_common_vals array to have 34 values, but it has 8.

Ed

Re: Why won't it index scan?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> I'm trying to understand what happened here, and I have a theory.

The problem is the horrid misestimation of the selectivity of
"nursestation_key = 40":

               ->  Bitmap Index Scan on idx_visit_nursestation_key  (cost=0.00..69.35 rows=4956 width=0) (actual
time=0.158..0.158rows=6 loops=1) 
                     Index Cond: (nursestation_key = 40)

When you're off by a factor of 800+ on the number of matching rows,
you're going to arrive at a less than optimal plan.  Increasing the
stats target on visit.nursestation_key would be the solution.

> There are 389K rows total, and 262K rows with a null indexed
> value.  Their are 15164 non-null rows newer than those null
> rows.  When stats target is set to 50 or less, analyze scans
> 15,000 rows or less.  If it scans the newest rows/pages first,
> then is it possible it never sees any hint of the 262K null
> rows, and thus ends up with skewed stats that yield seq scans
> when idx scan is in order?

ANALYZE goes to some considerable trouble to make sure it gets an
unbiased random sample.  With those numbers it would see an expected
500+ of the nonnull rows; a sample containing none at all would be
highly improbable.

> Also, I see the most_common_vals array is not growing linearly
> with the stats target as the docs seem to suggest.  I have 34
> unique values, so with stats target >= 34, I'd expect
> most_common_vals array to have 34 values, but it has 8.

To get into most_common_vals, a value has to occur more than once in the
sample.  Given the situation you have, it's not surprising that not all
the possible values got into the stats.

            regards, tom lane

Re: Why won't it index scan?

От
"Ed L."
Дата:
On Wednesday May 17 2006 1:26 pm, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > I'm trying to understand what happened here, and I have a
> > theory.
>
> The problem is the horrid misestimation of the selectivity of
> "nursestation_key = 40":
>
>                ->  Bitmap Index Scan on
> idx_visit_nursestation_key  (cost=0.00..69.35 rows=4956
> width=0) (actual time=0.158..0.158 rows=6 loops=1) Index Cond:
> (nursestation_key = 40)
>
> When you're off by a factor of 800+ on the number of matching
> rows, you're going to arrive at a less than optimal plan.
> Increasing the stats target on visit.nursestation_key would be
> the solution.

Ok, makes sense.

So, does this sound like we just happened to get repeatedly
horribly unrepresentative random samples with stats target at
10?  Are we at the mercy of randomness here?  Or is there a
better preventive procedure we can follow to systematically
identify this kind of situation?

Ed

Re: Why won't it index scan?

От
Tom Lane
Дата:
"Ed L." <pgsql@bluepolka.net> writes:
> So, does this sound like we just happened to get repeatedly
> horribly unrepresentative random samples with stats target at
> 10?  Are we at the mercy of randomness here?  Or is there a
> better preventive procedure we can follow to systematically
> identify this kind of situation?

I think the real issue is that stats target 10 is too small for large
tables: the samples are just not large enough to support a decent
numdistinct estimate, which is the critical stat for cases such as this
(ie, estimating the number of hits on a value that's not in the
most-common-values list).

The reason the default is currently 10 is just conservatism: it was
already an order of magnitude better than what it replaced (a *single*
representative value) and I didn't feel I had the evidence to justify
higher values.  It's become clear that the default ought to be higher,
but I've still got no good fix on a more reasonable default.  100 might
be too much, or then again maybe not.

I encourage you to play around with default_statistics_target and see
what you can learn about quality of estimates vs. planning time.

            regards, tom lane

Re: Why won't it index scan?

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Ed L." <pgsql@bluepolka.net> writes:
> > So, does this sound like we just happened to get repeatedly
> > horribly unrepresentative random samples with stats target at
> > 10?  Are we at the mercy of randomness here?  Or is there a
> > better preventive procedure we can follow to systematically
> > identify this kind of situation?
>
> I think the real issue is that stats target 10 is too small for large
> tables: the samples are just not large enough to support a decent
> numdistinct estimate, which is the critical stat for cases such as this
> (ie, estimating the number of hits on a value that's not in the
> most-common-values list).

There's been some discussion on -hackers about this area. Sadly the idea of
using samples to calculate numdistinct estimates is fundamentally on pretty
shaky ground.

Whereas a fixed sample size works fine for calculating distribution of values,
in order to generate consistent precision for numdistinct estimates the
samples will have to be a constant fraction of the table -- and unfortunately
a pretty large fraction at that.

So sadly I think "at the mercy of randomness" is pretty accurate. You'll have
to raise the statistics target as the table grows and I expect you'll
eventually run into some downsides of large stats targets.

Some better algorithms were posted, but they would require full table scans
during analyze, not just samples.

--
greg

Re: Why won't it index scan?

От
Peter Kovacs
Дата:
Sorry for the naive question, but: is there a problem with analyze doing
full table scans? Analyze will not lock anything, will it?

Peter

Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>
>> "Ed L." <pgsql@bluepolka.net> writes:
>>
>>> So, does this sound like we just happened to get repeatedly
>>> horribly unrepresentative random samples with stats target at
>>> 10?  Are we at the mercy of randomness here?  Or is there a
>>> better preventive procedure we can follow to systematically
>>> identify this kind of situation?
>>>
>> I think the real issue is that stats target 10 is too small for large
>> tables: the samples are just not large enough to support a decent
>> numdistinct estimate, which is the critical stat for cases such as this
>> (ie, estimating the number of hits on a value that's not in the
>> most-common-values list).
>>
>
> There's been some discussion on -hackers about this area. Sadly the idea of
> using samples to calculate numdistinct estimates is fundamentally on pretty
> shaky ground.
>
> Whereas a fixed sample size works fine for calculating distribution of values,
> in order to generate consistent precision for numdistinct estimates the
> samples will have to be a constant fraction of the table -- and unfortunately
> a pretty large fraction at that.
>
> So sadly I think "at the mercy of randomness" is pretty accurate. You'll have
> to raise the statistics target as the table grows and I expect you'll
> eventually run into some downsides of large stats targets.
>
> Some better algorithms were posted, but they would require full table scans
> during analyze, not just samples.
>
>

Re: Why won't it index scan?

От
Bruce Momjian
Дата:
Peter Kovacs wrote:
> Sorry for the naive question, but: is there a problem with analyze doing
> full table scans? Analyze will not lock anything, will it?

It used to do that, but the read overhead was too great.

---------------------------------------------------------------------------

>
> Peter
>
> Greg Stark wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >
> >
> >> "Ed L." <pgsql@bluepolka.net> writes:
> >>
> >>> So, does this sound like we just happened to get repeatedly
> >>> horribly unrepresentative random samples with stats target at
> >>> 10?  Are we at the mercy of randomness here?  Or is there a
> >>> better preventive procedure we can follow to systematically
> >>> identify this kind of situation?
> >>>
> >> I think the real issue is that stats target 10 is too small for large
> >> tables: the samples are just not large enough to support a decent
> >> numdistinct estimate, which is the critical stat for cases such as this
> >> (ie, estimating the number of hits on a value that's not in the
> >> most-common-values list).
> >>
> >
> > There's been some discussion on -hackers about this area. Sadly the idea of
> > using samples to calculate numdistinct estimates is fundamentally on pretty
> > shaky ground.
> >
> > Whereas a fixed sample size works fine for calculating distribution of values,
> > in order to generate consistent precision for numdistinct estimates the
> > samples will have to be a constant fraction of the table -- and unfortunately
> > a pretty large fraction at that.
> >
> > So sadly I think "at the mercy of randomness" is pretty accurate. You'll have
> > to raise the statistics target as the table grows and I expect you'll
> > eventually run into some downsides of large stats targets.
> >
> > Some better algorithms were posted, but they would require full table scans
> > during analyze, not just samples.
> >
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Why won't it index scan?

От
"Jim C. Nasby"
Дата:
On Wed, May 17, 2006 at 10:29:14PM -0400, Tom Lane wrote:
> The reason the default is currently 10 is just conservatism: it was
> already an order of magnitude better than what it replaced (a *single*
> representative value) and I didn't feel I had the evidence to justify
> higher values.  It's become clear that the default ought to be higher,
> but I've still got no good fix on a more reasonable default.  100 might
> be too much, or then again maybe not.
>
> I encourage you to play around with default_statistics_target and see
> what you can learn about quality of estimates vs. planning time.

Is the only downside to a large value planning speed? It seems it would
be hard to bloat that too much, except in cases where people are
striving for millisecond response times, and those folks had better know
enough about tuning to be able to adjust the stats target...

I'd wager that upping the target to 100 would eliminate a lot more
support emails than it creates.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Why won't it index scan?

От
Tom Lane
Дата:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Wed, May 17, 2006 at 10:29:14PM -0400, Tom Lane wrote:
>> The reason the default is currently 10 is just conservatism: it was
>> already an order of magnitude better than what it replaced (a *single*
>> representative value) and I didn't feel I had the evidence to justify
>> higher values.  It's become clear that the default ought to be higher,
>> but I've still got no good fix on a more reasonable default.  100 might
>> be too much, or then again maybe not.

> Is the only downside to a large value planning speed? It seems it would
> be hard to bloat that too much, except in cases where people are
> striving for millisecond response times, and those folks had better know
> enough about tuning to be able to adjust the stats target...

It would be nice to have some *evidence*, not unsupported handwaving.

            regards, tom lane

Re: Why won't it index scan?

От
"Joshua D. Drake"
Дата:
> The reason the default is currently 10 is just conservatism: it was
> already an order of magnitude better than what it replaced (a *single*
> representative value) and I didn't feel I had the evidence to justify
> higher values.  It's become clear that the default ought to be higher,
> but I've still got no good fix on a more reasonable default.  100 might
> be too much, or then again maybe not.
>

My hands on experience is that 10 is plenty except for the minority of
tables within a database. Those table can be accurately represented
using alter table without having to adjust the global.

That being said, 10 is fairly small and I often find myself setting the
value to at least 250 just to keep it out of my way.

Sincerely,

Joshua D. Drake

--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: Why won't it index scan?

От
Alban Hertroys
Дата:
Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>
>>Is the only downside to a large value planning speed? It seems it would
>>be hard to bloat that too much, except in cases where people are
>>striving for millisecond response times, and those folks had better know
>>enough about tuning to be able to adjust the stats target...
>
>
> It would be nice to have some *evidence*, not unsupported handwaving.

For that you'd need a large enough sample for statistics sizes people
use/need. To me it has always been a bit vague under what conditions I'd
need to change statistics sizes fro tables, I imagine I'm not the only
one. That makes it all the harder to determine a good default value.

I suppose it'd be useful to have some kind of measurement toolkit that
people can run on their databases to collect statistics about what
statistics sizes would be "optimal". There must be some mathematical way
to determine this on a given data set?

Being able to provide these numbers to you guys would then help in
determining what a good default statistics size is, and maybe even for
determining an algorithm to adjust statistics sizes on the fly...

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

Re: Why won't it index scan?

От
Ed Loehr
Дата:
On Monday May 22 2006 4:43 pm, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Wed, May 17, 2006 at 10:29:14PM -0400, Tom Lane wrote:
> >> The reason the default is currently 10 is just
> >> conservatism: it was already an order of magnitude better
> >> than what it replaced (a *single* representative value) and
> >> I didn't feel I had the evidence to justify higher values.
> >> It's become clear that the default ought to be higher, but
> >> I've still got no good fix on a more reasonable default.
> >> 100 might be too much, or then again maybe not.
> >
> > Is the only downside to a large value planning speed? It
> > seems it would be hard to bloat that too much, except in
> > cases where people are striving for millisecond response
> > times, and those folks had better know enough about tuning
> > to be able to adjust the stats target...
>
> It would be nice to have some *evidence*, not unsupported
> handwaving.

Not exactly related the topic at hand, but we set the column
specific target to 50 before it would use the plan.  Then we set
it to values upto 500, analyzing after each alter, and for each
each of those fell back into seq scanning.  Of 50, 60, 70, 100,
200, 300, 500, only 50 was foolproof.  I have no idea why.  I
would have only expected the plan to get better with higher
targets.

Ed

Re: Why won't it index scan?

От
"Jim C. Nasby"
Дата:
On Mon, May 22, 2006 at 06:43:22PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Wed, May 17, 2006 at 10:29:14PM -0400, Tom Lane wrote:
> >> The reason the default is currently 10 is just conservatism: it was
> >> already an order of magnitude better than what it replaced (a *single*
> >> representative value) and I didn't feel I had the evidence to justify
> >> higher values.  It's become clear that the default ought to be higher,
> >> but I've still got no good fix on a more reasonable default.  100 might
> >> be too much, or then again maybe not.
>
> > Is the only downside to a large value planning speed? It seems it would
> > be hard to bloat that too much, except in cases where people are
> > striving for millisecond response times, and those folks had better know
> > enough about tuning to be able to adjust the stats target...
>
> It would be nice to have some *evidence*, not unsupported handwaving.

If someone has an idea on how to actually get that evidence, I'm all
ears. In the meantime, the number of people that run into problems with
the default of 10 provides pretty good evidence that that number is far
too low, so it would be better to take a guess at a better number and
miss whatever would be optimal than continue on with something that's
pretty clearly a problem.

Of course this applies to other GUC's as well; the autovacuum settings
come to mind, now that 8.2 does a better job of finding a more realistic
value for shared_buffers.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Why won't it index scan?

От
"Jim C. Nasby"
Дата:
On Mon, May 22, 2006 at 05:55:16PM -0700, Joshua D. Drake wrote:
>
> >The reason the default is currently 10 is just conservatism: it was
> >already an order of magnitude better than what it replaced (a *single*
> >representative value) and I didn't feel I had the evidence to justify
> >higher values.  It's become clear that the default ought to be higher,
> >but I've still got no good fix on a more reasonable default.  100 might
> >be too much, or then again maybe not.
> >
>
> My hands on experience is that 10 is plenty except for the minority of
> tables within a database. Those table can be accurately represented
> using alter table without having to adjust the global.

Ditto.

> That being said, 10 is fairly small and I often find myself setting the
> value to at least 250 just to keep it out of my way.

And ditto (though I normally go for 100).

Have you ever run into problems from setting this too high?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Why won't it index scan?

От
"Joshua D. Drake"
Дата:
>> That being said, 10 is fairly small and I often find myself setting the
>> value to at least 250 just to keep it out of my way.
>
> And ditto (though I normally go for 100).
>
> Have you ever run into problems from setting this too high?

No. Except that it does increase the time it takes to analyze which can
be a performance factor if you IO is limited.

Joshua D. Drake



--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/



Re: Why won't it index scan?

От
Tom Lane
Дата:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Mon, May 22, 2006 at 06:43:22PM -0400, Tom Lane wrote:
>> It would be nice to have some *evidence*, not unsupported handwaving.

> If someone has an idea on how to actually get that evidence, I'm all
> ears.

Well, for example, actually measuring the planner overhead from larger
pg_statistic entries would be interesting.  Plus how much more time
ANALYZE takes to generate the entries.  (I'm afraid that ANALYZE is
probably worse-than-linear CPU-wise, but it may be dominated by disk
access.)

            regards, tom lane

Re: Why won't it index scan?

От
"Jim C. Nasby"
Дата:
On Tue, May 23, 2006 at 06:18:07PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Mon, May 22, 2006 at 06:43:22PM -0400, Tom Lane wrote:
> >> It would be nice to have some *evidence*, not unsupported handwaving.
>
> > If someone has an idea on how to actually get that evidence, I'm all
> > ears.
>
> Well, for example, actually measuring the planner overhead from larger
> pg_statistic entries would be interesting.  Plus how much more time
> ANALYZE takes to generate the entries.  (I'm afraid that ANALYZE is
> probably worse-than-linear CPU-wise, but it may be dominated by disk
> access.)

How should I go about analyzing planner time? Subtract \timing from
EXPLAIN ANALYZE?

Well, I did find one reason not to go ape with this: the number of pages
analyzed scales with the number of buckets, so doubling the statistics
target will roughly double the ANALYZE time for any table over 6000
pages (though the effect isn't linear, see below). There is a small
increase in time for a small table, but I doubt it's enough for anyone
to care:

(single AMD64, 1G memory, FBSD 6.0, software RAID1)
bench=# set default_statistics_target= 100;
SET
Time: 0.320 ms
bench=# analyze verbose accounts;analyze verbose branches;
INFO:  analyzing "public.accounts"
INFO:  "accounts": scanned 30000 of 5172414 pages, containing 1740000 live rows and 0 dead rows; 30000 rows in sample,
300000012estimated total rows 
ANALYZE
Time: 198892.080 ms
INFO:  analyzing "public.branches"
INFO:  "branches": scanned 17 of 17 pages, containing 3000 live rows and 0 dead rows; 3000 rows in sample, 3000
estimatedtotal rows 
ANALYZE
Time: 25.133 ms
bench=# set default_statistics_target= 10;
SET
Time: 0.212 ms
bench=# analyze verbose accounts;analyze verbose branches;
INFO:  analyzing "public.accounts"
INFO:  "accounts": scanned 3000 of 5172414 pages, containing 174000 live rows and 0 dead rows; 3000 rows in sample,
300000012estimated total rows 
ANALYZE
Time: 27227.885 ms
INFO:  analyzing "public.branches"
INFO:  "branches": scanned 17 of 17 pages, containing 3000 live rows and 0 dead rows; 3000 rows in sample, 3000
estimatedtotal rows 
ANALYZE
Time: 1.973 ms
bench=# analyze branches;
ANALYZE
Time: 2.016 ms
bench=# analyze branches;
ANALYZE
Time: 2.009 ms
bench=# set default_statistics_target= 100;
SET
Time: 0.210 ms
bench=# analyze branches;
ANALYZE
Time: 2.231 ms
bench=# analyze branches;
ANALYZE
Time: 2.346 ms
bench=# analyze branches;
ANALYZE
Time: 9.220 ms
bench=# analyze branches;
ANALYZE
Time: 2.057 ms
bench=#
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Why won't it index scan?

От
"Ed L."
Дата:
On Tuesday May 23 2006 4:55 pm, Jim C. Nasby wrote:
> Well, I did find one reason not to go ape with this: the
> number of pages analyzed scales with the number of buckets, so
> doubling the statistics target will roughly double the ANALYZE
> time for any table over 6000 pages (though the effect isn't
> linear, see below). There is a small increase in time for a
> small table, but I doubt it's enough for anyone to care:

Are you accounting for your well-primed OS and DB caches?  I'd
think a more realistic test would clear those of the target
table between measurements.

Ed

Re: Why won't it index scan?

От
"Jim C. Nasby"
Дата:
On Tue, May 23, 2006 at 05:00:13PM -0600, Ed L. wrote:
> On Tuesday May 23 2006 4:55 pm, Jim C. Nasby wrote:
> > Well, I did find one reason not to go ape with this: the
> > number of pages analyzed scales with the number of buckets, so
> > doubling the statistics target will roughly double the ANALYZE
> > time for any table over 6000 pages (though the effect isn't
> > linear, see below). There is a small increase in time for a
> > small table, but I doubt it's enough for anyone to care:
>
> Are you accounting for your well-primed OS and DB caches?  I'd
> think a more realistic test would clear those of the target
> table between measurements.

That was after a number of other analyze runs that had already happened,
and the table way larger than my cache, so it's unlikely that caching
played much of an issue.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461