Обсуждение: strange index behaviour with different statistics target

Поиск
Список
Период
Сортировка

strange index behaviour with different statistics target

От
Jeff Frost
Дата:
So, I had a query that uses a postgis geometry index and the planner was
underestimating the number of rows it would return.  Because of this,
the planner was choosing the geometry index over a compound index on the
other columns in the WHERE clause.  So, I thought, let me increase the
stats target for that geometry column.  I did, and I got a different
(and better) plan, but when I looked at the estimates for the simplified
query against the geometry column alone, I noticed that neither the cost
nor the estimated rows changed:

oitest=# explain ANALYZE  SELECT * FROM "blips" WHERE
((ST_Contains(blips.shape,
'0101000020E610000049111956F1EB55C0A8E49CD843F34440')) );

QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_blips_on_shape_gist on blips  (cost=0.00..7.33
rows=1 width=13804) (actual time=0.113..745.394 rows=2827 loops=1)
   Index Cond: (shape &&
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry)
   Filter: ((shape &&
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) AND
_st_contains(shape,
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry))
 Total runtime: 745.977 ms
(4 rows)

Time: 747.199 ms
oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 1000;
ALTER TABLE
Time: 0.478 ms
oitest=# ANALYZE ;
ANALYZE
Time: 7727.097 ms
oitest=# explain ANALYZE  SELECT * FROM "blips" WHERE
((ST_Contains(blips.shape,
'0101000020E610000049111956F1EB55C0A8E49CD843F34440')) );

QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_blips_on_shape_gist on blips  (cost=0.00..7.33
rows=1 width=13761) (actual time=0.117..755.781 rows=2827 loops=1)
   Index Cond: (shape &&
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry)
   Filter: ((shape &&
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry) AND
_st_contains(shape,
'0101000020E610000049111956F1EB55C0A8E49CD843F34440'::geometry))
 Total runtime: 756.396 ms
(4 rows)

The width changed slightly, but the cost is 7.33 in both.

So, now I thought how could that have changed the plan?  Did the other
parts of the plan estimate change?  So I pulled the shape column out of
the where clause and left the others:

oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 100;
ALTER TABLE
Time: 0.475 ms
oitest=# ANALYZE ;
ANALYZE
Time: 1225.325 ms
oitest=# explain ANALYZE  SELECT * FROM "blips" WHERE
(blips."content_id" = 2410268 AND blips."content_type" = E'Story');

QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_blips_on_content_type_and_content_id on blips
(cost=0.00..9.01 rows=2 width=13924) (actual time=0.026..0.027 rows=2
loops=1)
   Index Cond: (((content_type)::text = 'Story'::text) AND (content_id =
2410268))
 Total runtime: 0.046 ms
(3 rows)

Time: 1.111 ms
oitest=# ALTER TABLE blips ALTER COLUMN shape SET statistics 1000;
ALTER TABLE
Time: 0.506 ms
oitest=# ANALYZE ;
ANALYZE
Time: 7785.496 ms
oitest=# explain ANALYZE  SELECT * FROM "blips" WHERE
(blips."content_id" = 2410268 AND blips."content_type" = E'Story');
                                                             QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_blips_on_content_id on blips  (cost=0.00..7.29
rows=1 width=13761) (actual time=0.013..0.014 rows=2 loops=1)
   Index Cond: (content_id = 2410268)
   Filter: ((content_type)::text = 'Story'::text)
 Total runtime: 0.034 ms
(4 rows)

Time: 1.007 ms

So, my question is, should changing the stats target on the shape column
affect the stats for the content_id and content_type columns?  Also, why
does the index on content_id win out over the compound index on
(content_type, content_id)?

    "index_blips_on_content_id" btree (content_id)
    "index_blips_on_content_type_and_content_id" btree (content_type,
content_id)

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 916-647-6411    FAX: 916-405-4032


Re: strange index behaviour with different statistics target

От
Tom Lane
Дата:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> So, my question is, should changing the stats target on the shape column
> affect the stats for the content_id and content_type columns?

It would change the size of the sample for the table, which might
improve the accuracy of the stats.  IIRC you'd still get the same number
of histogram entries and most-common-values for the other columns, but
they might be more accurate.

> Also, why does the index on content_id win out over the compound index
> on (content_type, content_id)?

It's deciding (apparently correctly, from the explain results) that the
larger index isn't increasing the selectivity enough to be worth its
extra search cost.  I suppose content_type = 'Story' isn't very
selective in this table?

            regards, tom lane

Re: strange index behaviour with different statistics target

От
Jeff Frost
Дата:
On Tue, 13 Jan 2009, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> So, my question is, should changing the stats target on the shape column
>> affect the stats for the content_id and content_type columns?
>
> It would change the size of the sample for the table, which might
> improve the accuracy of the stats.  IIRC you'd still get the same number
> of histogram entries and most-common-values for the other columns, but
> they might be more accurate.

Why would they be more accurate?  Do they somehow correlate with the other
column's histogram and most-common-values when the stats target is increased
on that column?

The planner is choosing a plan I like for the query, I'm just trying to
understand why it's doing that since the planner thinks the gist index is
going to give it a single row (vs the 2827 rows it actually gets) and the fact
that the cost didn't change for perusing the gist index.  I guess I was
expecting the estimated rowcount and cost for perusing the gist index to go up
and when it didn't I was pleasantly surprised to find I got a plan I wanted
anyway.

>
>> Also, why does the index on content_id win out over the compound index
>> on (content_type, content_id)?
>
> It's deciding (apparently correctly, from the explain results) that the
> larger index isn't increasing the selectivity enough to be worth its
> extra search cost.  I suppose content_type = 'Story' isn't very
> selective in this table?

Ah!  You're right, especially with this content_id!

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 916-647-6411    FAX: 916-405-4032

Re: strange index behaviour with different statistics target

От
Tom Lane
Дата:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> On Tue, 13 Jan 2009, Tom Lane wrote:
>> It would change the size of the sample for the table, which might
>> improve the accuracy of the stats.  IIRC you'd still get the same number
>> of histogram entries and most-common-values for the other columns, but
>> they might be more accurate.

> Why would they be more accurate?

They'd be drawn from a larger sample of the table rows.  If we need a
random sample of N rows for the largest stats target among the columns,
we use all those rows for deriving the stats for the other columns too.

> The planner is choosing a plan I like for the query, I'm just trying to
> understand why it's doing that since the planner thinks the gist index is
> going to give it a single row (vs the 2827 rows it actually gets) and the fact
> that the cost didn't change for perusing the gist index.

You'd need to ask the postgis guys whether they have an estimator for
ST_Contains that actually does anything useful.  I haven't the foggiest
what the state of their stats support is.

[ looks again at the plan... ]  Actually it looks like the estimator
for && is what's at issue.  Estimators are attached to operators not
functions.

            regards, tom lane

Re: strange index behaviour with different statistics target

От
Jeff Frost
Дата:
On Tue, 13 Jan 2009, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> On Tue, 13 Jan 2009, Tom Lane wrote:
>>> It would change the size of the sample for the table, which might
>>> improve the accuracy of the stats.  IIRC you'd still get the same number
>>> of histogram entries and most-common-values for the other columns, but
>>> they might be more accurate.
>
>> Why would they be more accurate?
>
> They'd be drawn from a larger sample of the table rows.  If we need a
> random sample of N rows for the largest stats target among the columns,
> we use all those rows for deriving the stats for the other columns too.

Oh, ok, thanks Tom.  That makes sense now.

>
>> The planner is choosing a plan I like for the query, I'm just trying to
>> understand why it's doing that since the planner thinks the gist index is
>> going to give it a single row (vs the 2827 rows it actually gets) and the fact
>> that the cost didn't change for perusing the gist index.
>
> You'd need to ask the postgis guys whether they have an estimator for
> ST_Contains that actually does anything useful.  I haven't the foggiest
> what the state of their stats support is.
>
> [ looks again at the plan... ]  Actually it looks like the estimator
> for && is what's at issue.  Estimators are attached to operators not
> functions.

Thanks, I'll see if I can dig up some info on that and/or post to the
postgis list if I can't turn anything up.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 916-647-6411    FAX: 916-405-4032