Обсуждение: Index of a table is not used (in any case)
Hello PostgreSQl Users! PostSQL V 7.1.1: I have defined a table and the necessary indices. But the index is not used in every SELECT. (Therefore, the selects are *very* slow, due to seq scan on 20 million entries, which is a test setup up to now) The definitions can be seen in the annex. Does some body know the reason and how to circumvent the seq scan? Is the order of index creation relevant? I.e., should I create the indices before inserting entries or the other way around? Should a hashing index be used? (I tried this, but I got the known error "Out of overflow pages") (The docu on "create index" says : "Notes The Postgres query optimizer will consider using a btree index whenever an indexed attribute is involved in a comparison using one of: <, <=, =, >=, > The Postgres query optimizer will consider using an rtree index whenever an indexed attribute is involved in a comparison using one of: <<, &<, &>, >>, @, ~=, && The Postgres query optimizer will consider using a hash index whenever an indexed attribute is involved in a comparison using the = operator. " The table entry 'epoche' is used in two different indices. Should that be avoided? Any suggestions are welcome. Thank you in advance. Reiner ------------------------------ Annex: ====== Table: ------ \d wetter Table "wetter"Attribute | Type | Modifier -----------+--------------------------+----------sensor_id | integer | not nullepoche | timestamp withtime zone | not nullwert | real | not null Indices: wetter_epoche_idx, wetter_pkey \d wetter_epoche_idx Index "wetter_epoche_idx"Attribute | Type -----------+--------------------------epoche | timestamp with time zone btree \d wetter_pkey Index "wetter_pkey"Attribute | Type -----------+--------------------------sensor_id | integerepoche | timestamp with time zone unique btree (primary key) Select where index is used: ============================ explain select * from wetter order by epoche desc; NOTICE: QUERY PLAN: Index Scan Backward using wetter_epoche_idx on wetter (cost=0.00..3216018.59 rows=20340000 width=16) EXPLAIN Select where the index is NOT used: =================================== explain select * from wetter where epoche between '1970-01-01' and '1980-01-01' order by epoche asc; NOTICE: QUERY PLAN: Sort (cost=480705.74..480705.74 rows=203400 width=16) -> Seq Scan on wetter (cost=0.00..454852.00 rows=203400 width=16) EXPLAIN -- Mit freundlichen Gruessen / With best regards Reiner Dassing
> Hello PostgreSQl Users! > > PostSQL V 7.1.1: You should upgrade to 7.1.3 at some point... > I have defined a table and the necessary indices. > But the index is not used in every SELECT. (Therefore, the selects are > *very* slow, due to seq scan on > 20 million entries, which is a test setup up to now) > > The definitions can be seen in the annex. > > Does some body know the reason and how to circumvent the seq scan? Yes. You probably have not run 'VACUUM ANALYZE' on your large table. > Is the order of index creation relevant? I.e., should I create the > indices before inserting > entries or the other way around? If you are inserting a great many entries, insert the data first and then create the indices - it will be much faster this way. > Should a hashing index be used? (I tried this, but I got the known error > "Out of overflow pages") Just do the default CREATE INDEX - btree should be fine... (probably) > The table entry 'epoche' is used in two different indices. Should that > be avoided? It's not a problem, but just check your EXPLAIN output after the VACUUM to check that you have them right. Chris
Reiner Dassing <dassing@wettzell.ifag.de> writes: > Hello PostgreSQl Users! > > PostSQL V 7.1.1: > > I have defined a table and the necessary indices. > But the index is not used in every SELECT. (Therefore, the selects are > *very* slow, due to seq scan on > 20 million entries, which is a test setup up to now) Perennial first question: did you VACUUM ANALYZE? -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863
Doug McNaught wrote: > Reiner Dassing <dassing@wettzell.ifag.de> writes: > > > Hello PostgreSQl Users! > > > > PostSQL V 7.1.1: > > > > I have defined a table and the necessary indices. > > But the index is not used in every SELECT. (Therefore, the selects are > > *very* slow, due to seq scan on > > 20 million entries, which is a test setup up to now) > > Perennial first question: did you VACUUM ANALYZE? Can there, or could there, be a notion of "rule based" optimization of queries in PostgreSQL? The "not using index" problem is probably the most common and most misunderstood problem.
Reiner Dassing <dassing@wettzell.ifag.de> writes:
> explain select * from wetter order by epoche desc;    
> NOTICE:  QUERY PLAN:
> Index Scan Backward using wetter_epoche_idx on wetter 
> (cost=0.00..3216018.59 rows=20340000 width=16)
> explain select * from wetter where epoche between '1970-01-01' and
> '1980-01-01' order by epoche asc;
> NOTICE:  QUERY PLAN:
> Sort  (cost=480705.74..480705.74 rows=203400 width=16)
>   ->  Seq Scan on wetter  (cost=0.00..454852.00 rows=203400 width=16)
It's hard to believe that you've done a VACUUM ANALYZE on this table,
since you are getting a selectivity estimate of exactly 0.01, which
just happens to be the default selectivity estimate for range queries.
How many rows are there really in this date range?
Anyway, the reason the planner is picking a seqscan+sort is that it
thinks that will be faster than an indexscan.  It's not necessarily
wrong.  Have you compared the explain output and actual timings both
ways?  (Use "set enable_seqscan to off" to force it to pick an indexscan
for testing purposes.)
        regards, tom lane
			
		> > > *very* slow, due to seq scan on > > > 20 million entries, which is a test setup up to now) > > > > Perennial first question: did you VACUUM ANALYZE? > > Can there, or could there, be a notion of "rule based" optimization of > queries in PostgreSQL? The "not using index" problem is probably the most > common and most misunderstood problem. There is a (sort of) rule based behavior in PostgreSQL, the down side of the current implementation is, that certain other commands than ANALYZE (e.g. "create index") partly update optimizer statistics. This is bad behavior, since then only part of the statistics are accurate. Statistics always have to be seen in context to other table's and other index'es statistics. Thus, currently the rule based optimizer only works if you create the indexes on empty tables (before loading data), which obviously has downsides. Else you have no choice but to ANALYZE frequently. I have tried hard to fight for this pseudo rule based behavior, but was only partly successful in convincing core. My opinion is, that (unless runtime statistics are kept) no other command than ANALYZE should be allowed to touch optimizer relevant statistics (maybe unless explicitly told to). Andreas
Zeugswetter Andreas SB SD wrote: > > > > > *very* slow, due to seq scan on > > > > 20 million entries, which is a test setup up to now) > > > > > > Perennial first question: did you VACUUM ANALYZE? > > > > Can there, or could there, be a notion of "rule based" optimization of > > queries in PostgreSQL? The "not using index" problem is probably the > most > > common and most misunderstood problem. > > There is a (sort of) rule based behavior in PostgreSQL, > the down side of the current implementation is, that certain > other commands than ANALYZE (e.g. "create index") partly update > optimizer statistics. This is bad behavior, since then only part > of the statistics are accurate. Statistics always have to be seen > in context to other table's and other index'es statistics. > > Thus, currently the rule based optimizer only works if you create > the indexes on empty tables (before loading data), which obviously > has downsides. Else you have no choice but to ANALYZE frequently. > > I have tried hard to fight for this pseudo rule based behavior, > but was only partly successful in convincing core. My opinion is, > that (unless runtime statistics are kept) no other command than > ANALYZE should be allowed to touch optimizer relevant statistics > (maybe unless explicitly told to). Perhaps there could be an extension to ANALYZE, i.e. ANALYZE RULEBASED tablename that would restore or recalculate the state that a table would be if all indexes were created from scratch? The "not using index" was very frustrating to understand. The stock answer, "did you vacuum?" just isn't enough. There has to be some explanation (in the FAQ or something) about the indexed key distribution in your data. Postgres' statistics are pretty poor too, a relative few very populous entries in a table will make it virtually impossible for the cost based optimizer (CBO) to use an index. At my site we have lots of tables that have many duplicate items in an index. It is a music based site and has a huge amount of "Various Artists" entries. No matter what we do, there is NO way to get Postgres to use the index from the query alone. We have over 20 thousand artists, but 5 "Various Artists" or "Soundtrack" entries change the statistics so much that they exclude an index scan. We have to run the system with sequential scan disabled. Running with seq disabled eliminates the usefulness of the CBO because when it is a justified table scan, it does an index scan. I have approached this windmill before and a bit regretful at bringing it up again, but it is important, very important. There needs to be a way to direct the optimizer about how to optimize the query. Using "set foo=bar" prior to a query is not acceptable. Web sites use persistent connections to the databases and since "set" can not be restored, you override global settings for the session, or have to code, in the web page, the proper default setting. The result is either that different web processes will behave differently depending on the order in which they execute queries, or you have to have your DBA write web pages. A syntax like: select * from table where /* enable_seqscan = false */ key = 'value'; Would be great in that you could tune the optimizer as long as the settings were for the clause directly following the directive, without affecting the state of the session or transaction. For instance: select id from t1, t2 where /* enable_seqscan = false */ t1.key = 'value' and t2.key = 'test' and t1.id = t2.id; The where "t1.key = 'value'" condition would be prohibited from using a sequntial scan, while the "t2.key = 'test'" would use it if it made sense. Is this possible?
mlw <markw@mohawksoft.com> writes:
> ... Postgres' statistics are pretty poor too, a relative few very
> populous entries in a table will make it virtually impossible for the
> cost based optimizer (CBO) to use an index.
Have you looked at development sources lately?
        regards, tom lane
			
		mlw writes: > The "not using index" was very frustrating to understand. The stock answer, > "did you vacuum?" just isn't enough. There has to be some explanation (in the > FAQ or something) about the indexed key distribution in your data. Most "not using index" questions seem to be related to a misunderstanding of users to the effect that "if there is an index it must be used, not matter what the query", which is of course far from reality. Add to that the (related) category of inquiries from people that think the index ought to be used but don't have any actual timings to show, you have a lot of people that just need to be educated. Of course the question "did you vacuum" (better, did you analyze) is annoying, just as the requirement to analyze is annoying in the first place, but unless someone designs a better query planner it will have to do. The reason why we always ask that question first is that people invariantly have not analyzed. A seasoned developer can often tell from the EXPLAIN output whether ANALYZE has been done, but users cannot. Perhaps something can be done in this area, but I'm not exactly sure what. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> Of course the question "did you vacuum" (better, did you analyze) is > annoying, just as the requirement to analyze is annoying in the first > place, but unless someone designs a better query planner it > will have to do. The reason why we always ask that question first is > that people invariantly have not analyzed. I think it is also not allways useful to ANALYZE. There are applications that choose optimal plans with only the rudimentary statistics VACUUM creates. And even such that use optimal plans with only the default statistics in place. Imho one of the biggest sources for problems is people creating new indexes on populated tables when the rest of the db/table has badly outdated statistics or even only default statistics in place. In this situation the optimizer is badly misguided, because it now sees completely inconsistent statistics to work on. (e.g. old indexes on that table may seem way too cheap compared to table scan) I would thus propose a more distinguished approach of writing the statistics gathered during "create index" to the system tables. Something like: if (default stats in place) write defaults else if (this is the only index) write gathered statistics else write only normalized statistics for index (e.g. index.reltuples = table.reltuples; index.relpages= (index.gathered.relpages * table.relpages / table.gathered.relpages) Andreas
Hello Tom! Tom Lane wrote: > > Reiner Dassing <dassing@wettzell.ifag.de> writes: > > explain select * from wetter order by epoche desc; > > NOTICE: QUERY PLAN: > > > Index Scan Backward using wetter_epoche_idx on wetter > > (cost=0.00..3216018.59 rows=20340000 width=16) > > > explain select * from wetter where epoche between '1970-01-01' and > > '1980-01-01' order by epoche asc; > > NOTICE: QUERY PLAN: > > > Sort (cost=480705.74..480705.74 rows=203400 width=16) > > -> Seq Scan on wetter (cost=0.00..454852.00 rows=203400 width=16) > > It's hard to believe that you've done a VACUUM ANALYZE on this table, > since you are getting a selectivity estimate of exactly 0.01, which > just happens to be the default selectivity estimate for range queries. > How many rows are there really in this date range? > Well, I did not claim that i made a VACUUM ANALYZE, I just set up a new table for testing purposes doing just INSERTs. After VACUUM ANALYSE the results look like: explain select * from wetter where epoche between '1970-01-01' and test_wetter-# '1980-01-01' order by epoche asc; NOTICE: QUERY PLAN: Index Scan using wetter_epoche_idx on wetter (cost=0.00..3313780.74 rows=20319660 width=16) EXPLAIN Now, the INDEX Scan is used and therefore, the query is very fast, as expected. For me, as a user not being involved in all the intrinsics of PostgreSQL, the question was "Why is this SELECT so slow?" (this question is asked a lot of times in this Mail lists) Now, I would like to say thank you! You have explained me and hopefully many more users what is going on behind the scene. > Anyway, the reason the planner is picking a seqscan+sort is that it > thinks that will be faster than an indexscan. It's not necessarily > wrong. Have you compared the explain output and actual timings both > ways? (Use "set enable_seqscan to off" to force it to pick an indexscan > for testing purposes.) > > regards, tom lane -- Mit freundlichen Gruessen / With best regards Reiner Dassing
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> Imho one of the biggest sources for problems is people creating new
> indexes on populated tables when the rest of the db/table has badly
> outdated statistics or even only default statistics in place.
> In this situation the optimizer is badly misguided, because it now
> sees completely inconsistent statistics to work on.
> (e.g. old indexes on that table may seem way too cheap compared 
> to table scan) 
I don't think any of this is correct.  We don't have per-index
statistics.  The only stats updated by CREATE INDEX are the same ones
updated by plain VACUUM, viz the number-of-tuples and number-of-pages
counts in pg_class.  I believe it's reasonable to update those stats
more often than the pg_statistic stats (in fact, if we could keep them
constantly up-to-date at a reasonable cost, we'd do so).  The
pg_statistic stats are designed as much as possible to be independent
of the absolute number of rows in the table, so that it's okay if they
are out of sync with the pg_class stats.
The major reason why "you vacuumed but you never analyzed" is such a
killer is that in the absence of any pg_statistic data, the default
selectivity estimates are such that you may get either an index or seq
scan depending on how big the table is.  The cost estimates are
nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the
exact shape of the curve) and ye olde default 0.01 will give you an
indexscan for a small table but not for a big one.  In 7.2 I have
reduced the default selectivity estimate to 0.005, for a number of
reasons but mostly to get it out of the range where the decision will
flip-flop.  Observe:
test71=# create table foo (f1 int);
CREATE
test71=# create index fooi on foo(f1);
CREATE
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:
Index Scan using fooi on foo  (cost=0.00..8.14 rows=10 width=4)
EXPLAIN
test71=# select reltuples,relpages from pg_class where relname = 'foo';reltuples | relpages
-----------+----------     1000 |       10
(1 row)
EXPLAIN
test71=# update pg_class set reltuples = 100000, relpages = 1000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:
Index Scan using fooi on foo  (cost=0.00..1399.04 rows=1000 width=4)
EXPLAIN
test71=# update pg_class set reltuples = 1000000, relpages = 10000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:
Seq Scan on foo  (cost=0.00..22500.00 rows=10000 width=4)
EXPLAIN
test71=#
In current sources you keep getting an indexscan as you increase the
number of tuples...
        regards, tom lane
			
		Tom Lane writes: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > Imho one of the biggest sources for problems is people creating new > > indexes on populated tables when the rest of the db/table has badly > > outdated statistics or even only default statistics in place. > > In this situation the optimizer is badly misguided, because it now > > sees completely inconsistent statistics to work on. > > (e.g. old indexes on that table may seem way too cheap compared > > to table scan) > > I don't think any of this is correct. We don't have per-index > statistics. The only stats updated by CREATE INDEX are the same ones > updated by plain VACUUM, viz the number-of-tuples and number-of-pages > counts in pg_class. 1. Have I said anything about other stats, than relpages and reltuples ? 2. There is only limited use in the most accurate pg_statistics if reltuples and relpages is completely off. In the current behavior you eg get: rel1: pages = 100000 -- updated from "create index" index1 pages = 2 -- outdated index2 pages = 2000 -- current rel2: pages = 1 -- outdated --> Optimizer will prefer join order: rel2, rel1 > I believe it's reasonable to update those stats > more often than the pg_statistic stats (in fact, if we could keep them > constantly up-to-date at a reasonable cost, we'd do so). There is a whole lot of difference between keeping them constantly up to date and modifying (part of) them in the "create index" command, so I do not counter your above sentence, but imho the conclusion is wrong. > The > pg_statistic stats are designed as much as possible to be independent > of the absolute number of rows in the table, so that it's okay if they > are out of sync with the pg_class stats. Independently, they can only be good for choosing whether to use an index or seq scan. They are not sufficient to choose a good join order. > The major reason why "you vacuumed but you never analyzed" is such a > killer is that in the absence of any pg_statistic data, the default > selectivity estimates are such that you may get either an index or seq > scan depending on how big the table is. The cost estimates are > nonlinear (correctly so, IMHO, though I wouldn't necessarily > defend the > exact shape of the curve) and ye olde default 0.01 will give you an > indexscan for a small table but not for a big one. In 7.2 I have > reduced the default selectivity estimate to 0.005, for a number of > reasons but mostly to get it out of the range where the decision will > flip-flop. Yes, the new selectivity is better, imho even still too high. Imho the strategy should be to assume a good selectivity of values in absence of pg_statistics evidence. If the index was not selective enough for an average query, the dba should not have created the index in the first place. > test71=# create table foo (f1 int); > test71=# create index fooi on foo(f1); > test71=# explain select * from foo where f1 = 42; > Index Scan using fooi on foo (cost=0.00..8.14 rows=10 width=4) > test71=# update pg_class set reltuples = 100000, relpages = > 1000 where relname = 'foo'; > Index Scan using fooi on foo (cost=0.00..1399.04 rows=1000 width=4) > test71=# update pg_class set reltuples = 1000000, relpages = > 10000 where relname = 'foo'; > Seq Scan on foo (cost=0.00..22500.00 rows=10000 width=4) > In current sources you keep getting an indexscan as you increase the > number of tuples... As you can see it toppeled at 10 Mio rows :-( Andreas