Обсуждение: index vs. seq scan choice?
I am trying to figure out how the distribution of data affects index
usage by the query because I am seeing some behavior that does not seem
optimal to my uneducated eye.
I am on PG 8.1.8. I have two tables foo and foo_detail, both have been
vacuum analyzed recently. Both have a property_id column, both have an
index on it. The foo table has a state_code, also indexed, and the
relative share of rows for the two state_codes used in the example below
is:
  PA     2842    2.80%
  MN     2858    2.81%
The distribution of distinct property_ids is fairly similar:
  PA     719    2.90%
  MN     765    3.09%
A simple query filtered by PA vs. MN produces different results (see
below). The PA query does a Seq Scan, the MN query uses the index and is
>20 times faster. Both return about the same number of rows. I tried it
with all state_codes that have rows in foo and it seems that the cutoff
is somewhere around 3%, but there isn't a direct correlation (there are
state_codes that are < 3% that trigger a Seq Scan and there are ones
above 3% that result in an Index scan).
I am curious what could make the PA query to ignore the index. What are
the specific stats that are being used to make this decision? Would it
perform better if it were to use the index? Anything I can do to "nudge"
it towards using the index, which seems like a rather beneficial thing?
The actual queries:
explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code = 'PA'
Merge Join  (cost=17842.71..18436.30 rows=3347 width=4) (actual
time=594.538..972.032 rows=2842 loops=1)
  Merge Cond: ("outer".property_id = "inner".property_id)
  ->  Sort  (cost=4381.72..4390.09 rows=3347 width=4) (actual
time=14.092..18.497 rows=2842 loops=1)
        Sort Key: f.property_id
        ->  Bitmap Heap Scan on foo f  (cost=22.71..4185.78 rows=3347
width=4) (actual time=0.826..7.008 rows=2842 loops=1)
              Recheck Cond: (state_code = 'PA'::bpchar)
              ->  Bitmap Index Scan on mv_search_state
(cost=0.00..22.71 rows=3347 width=0) (actual time=0.734..0.734 rows=2842
loops=1)
                    Index Cond: (state_code = 'PA'::bpchar)
  ->  Sort  (cost=13460.99..13732.84 rows=108742 width=4) (actual
time=580.312..754.012 rows=110731 loops=1)
        Sort Key: fd.property_id
        ->  Seq Scan on foo_detail fd  (cost=0.00..4364.42 rows=108742
width=4) (actual time=0.006..210.846 rows=108742 loops=1)
Total runtime: 991.852 ms
explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code = 'MN'
Nested Loop  (cost=7.62..8545.85 rows=1036 width=4) (actual
time=0.877..44.196 rows=2858 loops=1)
  ->  Bitmap Heap Scan on foo f  (cost=7.62..2404.44 rows=1036 width=4)
(actual time=0.852..6.579 rows=2858 loops=1)
        Recheck Cond: (state_code = 'MN'::bpchar)
        ->  Bitmap Index Scan on mv_search_state  (cost=0.00..7.62
rows=1036 width=0) (actual time=0.744..0.744 rows=2858 loops=1)
              Index Cond: (state_code = 'MN'::bpchar)
  ->  Index Scan using ix_fd on foo_detail fd  (cost=0.00..5.92 rows=1
width=4) (actual time=0.005..0.007 rows=1 loops=2858)
        Index Cond: ("outer".property_id = fd.property_id)
Total runtime: 48.439 ms
			
		"George Pavlov" <gpavlov@mynewplace.com> writes:
> I am curious what could make the PA query to ignore the index. What are
> the specific stats that are being used to make this decision?
The frequency of the specific value being searched for, and the overall
order-correlation of the column.  Since the latter is not dependent on a
particular value, my guess at the reason for the inconsistent results is
that you don't have the column's statistics target set high enough to
track all the interesting values --- or maybe just not high enough to
acquire sufficiently accurate frequency estimates for them.  Take a look
at the pg_stats row for the column ...
(The default statistics target is 10, which is widely considered too
low --- you might find 100 more suitable.)
            regards, tom lane
			
		Tom Lane wrote: > (The default statistics target is 10, which is widely considered too > low --- you might find 100 more suitable.) Does this mean that we should look into raising the default a bit? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> (The default statistics target is 10, which is widely considered too
>> low --- you might find 100 more suitable.)
> Does this mean that we should look into raising the default a bit?
Probably ... the question is to what.
The default of 10 was chosen in our usual spirit of conservatism ---
and IIRC it was replacing code that tracked only *one* most common
value, so it was already a factor of 10 better (and more expensive)
than what was there before.  But subsequent history suggests it's
too small.  I'm not sure I want to vote for another 10x increase by
default, though.
            regards, tom lane
			
		Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Tom Lane wrote:
>>> (The default statistics target is 10, which is widely considered too
>>> low --- you might find 100 more suitable.)
>
>> Does this mean that we should look into raising the default a bit?
>
> Probably ... the question is to what.
>
> The default of 10 was chosen in our usual spirit of conservatism ---
> and IIRC it was replacing code that tracked only *one* most common
> value, so it was already a factor of 10 better (and more expensive)
> than what was there before.  But subsequent history suggests it's
> too small.  I'm not sure I want to vote for another 10x increase by
> default, though.
Outside of longer analyze times, and slightly more space taken up by the
statistics, what is the downside? I mean in reality... what is setting
to 100 going to do to effect actual production usage of even a modest
machine?
Sincerely,
Joshua D. Drake
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>
--
       === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
			
		"Joshua D. Drake" <jd@commandprompt.com> writes:
> Tom Lane wrote:
>> I'm not sure I want to vote for another 10x increase by
>> default, though.
> Outside of longer analyze times, and slightly more space taken up by the
> statistics, what is the downside?
Longer plan times --- several of the selfuncs.c routines grovel over all
the entries in the pg_statistic row.  AFAIK no one's measured the real
impact of that, but it could easily be counterproductive for simple queries.
            regards, tom lane
			
		On May 24, 2007, at 8:26 PM, Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Tom Lane wrote: >>> I'm not sure I want to vote for another 10x increase by >>> default, though. > >> Outside of longer analyze times, and slightly more space taken up >> by the >> statistics, what is the downside? > > Longer plan times --- several of the selfuncs.c routines grovel > over all > the entries in the pg_statistic row. AFAIK no one's measured the real > impact of that, but it could easily be counterproductive for simple > queries. The lateness of the hour is suppressing my supposed statistics savvy, so this may not make sense, but... Would it be possible to look at a much larger number of samples during analyze, then look at the variation in those to generate a reasonable number of pg_statistic "samples" to represent our estimate of the actual distribution? More datapoints for tables where the planner might benefit from it, fewer where it wouldn't. Cheers, Steve
> Would it be possible to look at a much larger number of samples during
> analyze,
> then look at the variation in those to generate a reasonable number of
> pg_statistic "samples" to represent our estimate of the actual
> distribution?
> More datapoints for tables where the planner might benefit from it, fewer
> where it wouldn't.
    Maybe it would be possible to take note somewhere of the percentage of
occurence of the most common value (in the OP's case, about 3%), in which
case a quick decision can be taken to use the index without even looking
at the value, if we know the most common one is below the index use
threshold...
			
		Steve Atkins wrote: > Would it be possible to look at a much larger number of samples > during analyze, > then look at the variation in those to generate a reasonable number of > pg_statistic "samples" to represent our estimate of the actual > distribution? > More datapoints for tables where the planner might benefit from it, > fewer > where it wouldn't. You could definitely try to measure the variance of the statistics (using, say, bootstrap resampling), and change the target 'til you got a "good" tradeoff between small sample size and adequate representation of the distribution. Unfortunately, I think the definition of "good" depends strongly on the kinds of queries that get run. Basically, you want the statistics target to be just big enough that more stats wouldn't change the plans for common queries. Remember, too, that this is not just one number, it'd be different for each column (perhaps zero for most). I could imagine hillclimbing the stats targets by storing common queries and then replaying them, while varying the sample size. There was a discussion last year related to all of this, see: http://archives.postgresql.org/pgsql-general/2006-10/msg00526.php - John D. Burger MITRE
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > "George Pavlov" <gpavlov@mynewplace.com> writes: > > I am curious what could make the PA query to ignore the > index. What are > > the specific stats that are being used to make this decision? > > you don't have the column's statistics target set high enough to > track all the interesting values --- or maybe just not high enough to > acquire sufficiently accurate frequency estimates for them. > Take a look at the pg_stats row for the column ... > > (The default statistics target is 10, which is widely considered too > low --- you might find 100 more suitable.) Well, it seems that it would be more beneficial for me to set it LOWER than the default 10. I get better performance if the stats are less accurate because then the optimizer seems more likely to choose the index! States that are in pg_stats.most_common_vals most often result in a Seq Scan, whereas ones that are not in it definitely get the Index Scan. For all states, even the largest ones (15% of the data), the Index Scan performs better. So, for example, with SET STATISTICS 10 my benhcmark query in a state like Indiana (2981 rows, ~3% of total) runs in 132ms. If I SET STATISTICS 100, Indiana gets on the most_common_vals list for the column and the query does a Seq Scan and its run time jumps to 977ms! If I go the other way and SET STATISTICS 1 (or 0) I can bring down the list to one entry (setting to 0 seems equivalent and still keeps the one most common entry!?) and I will get the Index scan for all states except for that one most common state. But, of course, I don't want to undermine the whole stats mechanism, I just want the system to use the index that is so helpful and brings runtimes down by a factor of 4-8! What am I missing here? George
George Pavlov wrote: >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> "George Pavlov" <gpavlov@mynewplace.com> writes: > to 977ms! If I go the other way and SET STATISTICS 1 (or 0) I can bring > down the list to one entry (setting to 0 seems equivalent and still > keeps the one most common entry!?) and I will get the Index scan for all > states except for that one most common state. But, of course, I don't > want to undermine the whole stats mechanism, I just want the system to > use the index that is so helpful and brings runtimes down by a factor of > 4-8! What am I missing here? In those rare cases wouldn't it make more sense to just set enable_seqscan to off; run query; set enable_seqscan to on; ?? Joshua D. Drake > > George > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
> From: Joshua D. Drake [mailto:jd@commandprompt.com] > > In those rare cases wouldn't it make more sense to just set > enable_seqscan to off; run query; set enable_seqscan to on; 1. these cases are not that rare (to me); 2. setting enable_seqscan (in JDBC, say) from the application makes the whole thing quite a mess (need to do a batch of statements: each query wrapped in its enable/disable seq scan?) -- ideally, one would like to issue mostly SQL statements, not config parameters from the application; 3. if this is the recommended suggestion on how to run queries then why don't we just add HINTS to the system and be done with it...
George Pavlov wrote:
>> From: Joshua D. Drake [mailto:jd@commandprompt.com]
>>
>> In those rare cases wouldn't it make more sense to just set
>> enable_seqscan to off; run query; set enable_seqscan to on;
>
> 1. these cases are not that rare (to me);
I find that surprising.
>
> 2. setting enable_seqscan (in JDBC, say) from the application makes the
> whole thing quite a mess (need to do a batch of statements: each query
> wrapped in its enable/disable seq scan?) -- ideally, one would like to
> issue mostly SQL statements, not config parameters from the application;
Uh no. You do it at the beginning of the transaction, run your queries
then reset it right before (or after) commit.
>
> 3. if this is the recommended suggestion on how to run queries then why
> don't we just add HINTS to the system and be done with it...
I suggest you read the archives, twice, before suggesting hints.
>
>
>
>
--
       === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
			
		"George Pavlov" <gpavlov@mynewplace.com> writes:
>> From: Joshua D. Drake [mailto:jd@commandprompt.com]
>> In those rare cases wouldn't it make more sense to just set
>> enable_seqscan to off; run query; set enable_seqscan to on;
> 1. these cases are not that rare (to me);
It strikes me that you probably need to adjust the planner cost
parameters to reflect reality on your system.  Usually dropping
random_page_cost is the way to bias the thing more in favor of
index scans.
            regards, tom lane
			
		> From: Tom Lane > "George Pavlov" <gpavlov@mynewplace.com> writes: > >> From: Joshua D. Drake [mailto:jd@commandprompt.com] > >> In those rare cases wouldn't it make more sense to just set > >> enable_seqscan to off; run query; set enable_seqscan to on; > > > 1. these cases are not that rare (to me); > > It strikes me that you probably need to adjust the planner cost > parameters to reflect reality on your system. Usually dropping > random_page_cost is the way to bias the thing more in favor of > index scans. Thanks, Tom, I will try that. Seems better than fiddling with enable_seqscan around every query/transaction. Joshua, I fail to understand why setting and unsetting enable_seqscan on a per query/transaction basis is in any way preferable to query hints? Don't get me wrong, I don't like the idea of hints, and I have read the archives on the subject and I agree with the philosophy, but if the optimization toolkit for routine application queries is going to include setting config parameters that just smacks of hints by another name... George
George Pavlov wrote:
>> From: Tom Lane
>> "George Pavlov" <gpavlov@mynewplace.com> writes:
>>>> From: Joshua D. Drake [mailto:jd@commandprompt.com]
>>>> In those rare cases wouldn't it make more sense to just set
>>>> enable_seqscan to off; run query; set enable_seqscan to on;
>>> 1. these cases are not that rare (to me);
>> It strikes me that you probably need to adjust the planner cost
>> parameters to reflect reality on your system.  Usually dropping
>> random_page_cost is the way to bias the thing more in favor of
>> index scans.
>
> Thanks, Tom, I will try that. Seems better than fiddling with
> enable_seqscan around every query/transaction.
>
> Joshua, I fail to understand why setting and unsetting enable_seqscan on
> a per query/transaction basis is in any way preferable to query hints?
> Don't get me wrong, I don't like the idea of hints, and I have read the
> archives on the subject and I agree with the philosophy, but if the
> optimization toolkit for routine application queries is going to include
> setting config parameters that just smacks of hints by another name...
I actually have zero opinion on hints, my comment was more about opening
the wasps nest of the hints discussion more than anything :)
Joshua D. Drake
>
> George
>
--
       === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
			
		In article <415.1181255628@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: % "George Pavlov" <gpavlov@mynewplace.com> writes: % >> From: Joshua D. Drake [mailto:jd@commandprompt.com] % >> In those rare cases wouldn't it make more sense to just set % >> enable_seqscan to off; run query; set enable_seqscan to on; % % > 1. these cases are not that rare (to me); % % It strikes me that you probably need to adjust the planner cost % parameters to reflect reality on your system. Usually dropping % random_page_cost is the way to bias the thing more in favor of % index scans. Also, increasing effective_cache_size. (And increasing statistics...) -- Patrick TJ McPhee North York Canada ptjm@interlog.com