Обсуждение: Expression index ignores column statistics target
I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran. I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?
CREATE TABLE foo (x integer);
CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));
INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1), generate_series(1, 100) AS g2(r2);
SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;
SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; most_common_vals
-----------------------------------------------------{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)
EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) Recheck Cond: (x =
13) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
Index Cond: (x = 13)Total runtime: 2.905 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) Recheck Cond: (abs(x)
=13) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220
loops=1) Index Cond: (abs(x) = 13)Total runtime: 2.875 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) Recheck Cond: (x =
18) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
Index Cond: (x = 18)Total runtime: 2.393 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) Recheck Cond: (abs(x)
=18) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180
loops=1) Index Cond: (abs(x) = 18)Total runtime: 2.418 ms
(5 rows)
--
Michael Fuhr
This is expected. The main TODO items is:* Allow accurate statistics to be collected on indexes with more than one
columnor expression indexes, perhaps using per-index statistics
Basically, we don't have multi-column or expression statistics. ANALYZE
just analyzes columns, even if an expression index exists.
---------------------------------------------------------------------------
Michael Fuhr wrote:
> I've noticed that row count estimates for expression indexes appear
> to rely on default_statistics_target rather than on a column's
> actual statistics target. That is, if I use ALTER TABLE SET
> STATISTICS to increase a column's statistics target and then run
> ANALYZE, then estimates for non-expression-index queries improve
> as expected. However, queries that use an expression index remain
> accurate for only around the N most common values, where N is the
> default_statistics_target that was in effect when ANALYZE ran. I'm
> still rummaging through the archives looking for past discussion;
> is this behavior a known limitation or just an oversight?
>
> CREATE TABLE foo (x integer);
>
> CREATE INDEX foo_x_idx ON foo (x);
> CREATE INDEX foo_abs_x_idx ON foo (abs(x));
>
> INSERT INTO foo (x)
> SELECT r1 % r2
> FROM generate_series(1, 100) AS g1(r1),
> generate_series(1, 100) AS g2(r2);
>
> SET default_statistics_target TO 15;
> ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
> ANALYZE foo;
>
> SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
> most_common_vals
> -----------------------------------------------------
> {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
> (1 row)
>
> EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
> QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
> Recheck Cond: (x = 13)
> -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220
loops=1)
> Index Cond: (x = 13)
> Total runtime: 2.905 ms
> (5 rows)
>
> EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
> QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
> Recheck Cond: (abs(x) = 13)
> -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220
loops=1)
> Index Cond: (abs(x) = 13)
> Total runtime: 2.875 ms
> (5 rows)
>
> EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
> QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
> Recheck Cond: (x = 18)
> -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180
loops=1)
> Index Cond: (x = 18)
> Total runtime: 2.393 ms
> (5 rows)
>
> EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
> QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
> Recheck Cond: (abs(x) = 18)
> -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180
loops=1)
> Index Cond: (abs(x) = 18)
> Total runtime: 2.418 ms
> (5 rows)
>
> --
> Michael Fuhr
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610)
359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square,
Pennsylvania19073
Michael Fuhr <mike@fuhr.org> writes:
> I've noticed that row count estimates for expression indexes appear
> to rely on default_statistics_target rather than on a column's
> actual statistics target. That is, if I use ALTER TABLE SET
> STATISTICS to increase a column's statistics target and then run
> ANALYZE, then estimates for non-expression-index queries improve
> as expected. However, queries that use an expression index remain
> accurate for only around the N most common values, where N is the
> default_statistics_target that was in effect when ANALYZE ran.
The code does in fact honor per-column statistics targets attached to
expression indexes, viz
alter table myfuncindex alter column pg_expression_1 set statistics 100;
This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets. I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.
regards, tom lane
On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote: > This is expected. The main TODO items is: > > * Allow accurate statistics to be collected on indexes with more than > one column or expression indexes, perhaps using per-index statistics > > Basically, we don't have multi-column or expression statistics. ANALYZE > just analyzes columns, even if an expression index exists. But the row count estimates imply that expression index queries do use column statistics, presumably as a proxy in the absence of expression statistics. This looks like a relevant commit: http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php The behavior I observed is that the planner does appear to use column statistics when planning an expression index query, but it doesn't appear to honor a column's non-default statistics target. In other words: * Row count estimates for expression index queries (at least simple ones) are reasonably accurate for the N most common columnvalues, where N is the value of default_statistics_target when ANALYZE was run. * Specifically setting the column's statistics target with ALTER TABLE SET STATISTICS doesn't result in better statisticsfor expression index queries. That difference in behavior seems odd: if default_statistics_target has an effect, why doesn't ALTER TABLE SET STATISTICS? -- Michael Fuhr
On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote: > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; Aha -- that's the piece I didn't know about. I was wondering where those statistics were being stored, since they were affected by default_statistics_target but not by per-column statistics targets. And now I see them when I don't restrict queries against pg_stats by just the table or column name. Thanks. -- Michael Fuhr
I wrote: > I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Ah, here it is: http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php No responses :-( regards, tom lane
On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: > I wrote: > > I seem to recall bringing up the question of whether > > we could find a less implementation-specific way of commanding this > > behavior, but I can't find it in the archives right now. > > Ah, here it is: > http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php > > No responses :-( Would an ALTER INDEX SET STATISTICS form be possible? -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: >> Ah, here it is: >> http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php > Would an ALTER INDEX SET STATISTICS form be possible? It's not so much the table/index misnomer that's bothering me, it's the lack of a clean way to identify which column of the index you are talking about. regards, tom lane
On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > Would an ALTER INDEX SET STATISTICS form be possible? > > It's not so much the table/index misnomer that's bothering me, it's > the lack of a clean way to identify which column of the index you > are talking about. Ah, I see -- I wasn't thinking about expressions in multicolumn indexes. What about identifying the column with the expression itself, ala quote_ident(pg_get_indexdef())? That might be tedious for the user to type but it would be attractive from a self-documentation standpoint. ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100; I do see that indexes allow multiple instances of the same expression, so this approach could be ambiguous. Or should such repetition be prohibited as it is with column names? test=> CREATE TABLE foo (x integer); CREATE TABLE test=> CREATE INDEX foo1_idx ON foo (x, x); ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index" test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x)); CREATE INDEX -- Michael Fuhr
Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > I've noticed that row count estimates for expression indexes appear > > to rely on default_statistics_target rather than on a column's > > actual statistics target. That is, if I use ALTER TABLE SET > > STATISTICS to increase a column's statistics target and then run > > ANALYZE, then estimates for non-expression-index queries improve > > as expected. However, queries that use an expression index remain > > accurate for only around the N most common values, where N is the > > default_statistics_target that was in effect when ANALYZE ran. > > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; > > This isn't documented, mainly because pg_dump doesn't promise to dump > such things, which it doesn't do because I didn't want to see the > "pg_expression_N" naming for expression index columns become graven on > stone tablets. I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Is this a TODO? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Michael Fuhr <mike@fuhr.org> writes:
> ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;
Yeah, that could probably be made to work.
> I do see that indexes allow multiple instances of the same expression,
> so this approach could be ambiguous.
I can't think of an actual use for that, though, so we could just ignore
the possible ambiguity. Or we could have the ALTER update all columns
matching the given expression.
> test=> CREATE INDEX foo1_idx ON foo (x, x);
> ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index"
Hmm, seems like there should be a more direct check for this ...
regards, tom lane