Обсуждение: Postgres optimizer choosing wrong index
I'm using postgres 7.4 and having a problem with the query optimizer. Our table, T, looks like this: dh int fh int nm int ... -- other columns A typical row is 400-500 bytes. T has two indexes, idx_df on (dh, fh) and idx_dn on (dh, nm). My query is select * from T where dh = ? and fh = ? If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables), before VACUUM ANALYZE, I get the desired execution plan: Index Scan using idx_df on T (cost=0.00..4.83 rows=1 width=454) Index Cond: ((dh = 1) AND (fh = 2)) But after VACUUM ANALYZE: Index Scan using idx_dn on T (cost=0.00..5.27 rows=1 width=561) Index Cond: (dh = 1) Filter: (fh = 2) Notice that postgres is now using the other index. This behavior is somewhat dependent on the values plugged in. I ran a query to count dh values: select dir_hash, count(*) from external_file group by dir_hash; dh | count ------------+-------- 916645488 | 20000 1057692240 | 200000 And if I use 1057692240 in the EXPLAIN, I get the desired plan: Index Scan using idx_df on external_file (cost=0.00..5.27 rows=1 width=561) Index Cond: ((dir_hash = 1057692240) AND (fn_hash = 2)) I've tried playing with various cost settings (e.g. random_page_cost), but have been unable to influence the optimizer's behavior. Rewriting the query as ... where (dh, fh) = (?, ?) doesn't help. So a few questions: - Why would the optimizer ever choose idx_dn over idx_df given that idx_df has to be more selective? - Is there any way to force the use of idx_df? Jack Orenstein P.S. Yes, I know, 7.4. We're upgrading to 8.3 but we have this problem right now.
Jack Orenstein <jack.orenstein@hds.com> writes: > If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables), > before VACUUM ANALYZE, I get the desired execution plan: > Index Scan using idx_df on T (cost=0.00..4.83 rows=1 width=454) > Index Cond: ((dh = 1) AND (fh = 2)) > But after VACUUM ANALYZE: > Index Scan using idx_dn on T (cost=0.00..5.27 rows=1 width=561) > Index Cond: (dh = 1) > Filter: (fh = 2) > Notice that postgres is now using the other index. This behavior is somewhat > dependent on the values plugged in. I ran a query to count dh values: > select dir_hash, count(*) from external_file group by dir_hash; > dh | count > ------------+-------- > 916645488 | 20000 > 1057692240 | 200000 So you're plugging in a value that doesn't appear in the index, and Postgres knows it? I think that in this example, the estimated costs are going to be exactly the same either way, so it's kind of a tossup which index gets chosen --- and it's not actually going to matter any at runtime either. Descending the btree to find that there's no matching entry is going to take just about the same amount of time in either index. If you plug in a value that *does* occur in the table it should probably choose the more-relevant index consistently. regards, tom lane
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables), >> before VACUUM ANALYZE, I get the desired execution plan: > >> Index Scan using idx_df on T (cost=0.00..4.83 rows=1 width=454) >> Index Cond: ((dh = 1) AND (fh = 2)) > >> But after VACUUM ANALYZE: > >> Index Scan using idx_dn on T (cost=0.00..5.27 rows=1 width=561) >> Index Cond: (dh = 1) >> Filter: (fh = 2) > >> Notice that postgres is now using the other index. This behavior is somewhat >> dependent on the values plugged in. I ran a query to count dh values: > >> select dir_hash, count(*) from external_file group by dir_hash; > >> dh | count >> ------------+-------- >> 916645488 | 20000 >> 1057692240 | 200000 > > So you're plugging in a value that doesn't appear in the index, and > Postgres knows it? > > I think that in this example, the estimated costs are going to be > exactly the same either way, so it's kind of a tossup which index > gets chosen --- and it's not actually going to matter any at runtime > either. Descending the btree to find that there's no matching entry > is going to take just about the same amount of time in either index. > If you plug in a value that *does* occur in the table it should probably > choose the more-relevant index consistently. Unfortunately, it matters a lot at runtime. The dh value is not very selective, as shown by the statistics above. But the fh value is highly selective. The idx_df index makes use of this selectivity; the idx_dn index does not. In fact, the reason I noticed this problem was horrendous runtime performance, sometimes. I started poking around and noticed the VACUUM ANALYZE dependence. Given that the estimated costs are about the same, I'm wondering why postgres doesn't go with the index that can do more work in the "Index Cond" part of the plan. Do you know how I can force use of the idx_df index? Jack
Jack Orenstein <jack.orenstein@hds.com> writes: > Tom Lane wrote: >> If you plug in a value that *does* occur in the table it should probably >> choose the more-relevant index consistently. > Unfortunately, it matters a lot at runtime. The dh value is not very selective, > as shown by the statistics above. A dh value that does not occur in the index is *perfectly* selective. I'm not sure what your problem is but this example isn't illustrating anything wrong that I can see. regards, tom lane
Researching this some more, it appears to be the case that VACUUM (by itself, no ANALYZE) is changing the optimizer's behavior. Here is a self-contained test: select '*** drop t'; drop table t cascade; select '*** create t(dh, fh, nm, filler)'; create table t (dh int, fh int, nm int, filler char(500)); select '*** create index on (dh, fh)'; create index idx_df on t(dh, fh); select '*** create index on (dh, nm)'; create index idx_dn on t(dh, nm); select '*** explain select * from t where dh = 1 and fh = 2'; explain select * from t where dh = 1 and fh = 2; select '*** vacuum t (no analyze)'; vacuum t; select '*** explain select * from t where dh = 1 and fh = 2'; explain select * from t where dh = 1 and fh = 2; This output was produced by 7.4.8. Version 8.3.4 uses the "wrong" execution plan both before and after the vacuum. Jack
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> Tom Lane wrote: >>> If you plug in a value that *does* occur in the table it should probably >>> choose the more-relevant index consistently. > >> Unfortunately, it matters a lot at runtime. The dh value is not very selective, >> as shown by the statistics above. > > A dh value that does not occur in the index is *perfectly* selective. > I'm not sure what your problem is but this example isn't illustrating > anything wrong that I can see. I see your point. I may have simplified too far. Our application runs a number of different queries. All our WHERE clauses restrict dh and fh. For a given pair of (dh, fh) values, the initial query should come up empty and then insert this pair, and then there is further processing (SELECT, UPDATE). Something is causing a huge number of index row reads (according to pg_stat_user_indexes) but only in tables that have been vacuumed. Jack
Jack Orenstein <jack.orenstein@hds.com> writes: > I may have simplified too far. Our application runs a number of > different queries. All our WHERE clauses restrict dh and fh. For a > given pair of (dh, fh) values, the initial query should come up empty > and then insert this pair, and then there is further processing > (SELECT, UPDATE). Something is causing a huge number of index row > reads (according to pg_stat_user_indexes) but only in tables that have > been vacuumed. Well, that's a bit more concrete but it's still difficult to tell where the problem is. Are you by any chance creating new tables and then vacuuming them while they're still empty? That would cause pg_class.relpages to get set to zero, and 7.4.x is not bright enough to change plans when you later fill the table (until you do another vacuum or analyze). However, I think 7.4 would always choose a seqscan on a table it thinks is zero-size, so I'm not sure that that's what's going on here. regards, tom lane
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> I may have simplified too far. Our application runs a number of >> different queries. All our WHERE clauses restrict dh and fh. For a >> given pair of (dh, fh) values, the initial query should come up empty >> and then insert this pair, and then there is further processing >> (SELECT, UPDATE). Something is causing a huge number of index row >> reads (according to pg_stat_user_indexes) but only in tables that have >> been vacuumed. > > Well, that's a bit more concrete but it's still difficult to tell where > the problem is. Are you by any chance creating new tables and then > vacuuming them while they're still empty? That would cause > pg_class.relpages to get set to zero, and 7.4.x is not bright enough to > change plans when you later fill the table (until you do another vacuum > or analyze). However, I think 7.4 would always choose a seqscan on a > table it thinks is zero-size, so I'm not sure that that's what's going > on here. I've collected much more detailed information on this problem. (Recap: In 7.4, table T(int dh, int fh, int nm, ...) has indexes idx_df on (dh, fh), and idx_dn on (dh, nm). Queries of the form "... where dh = ? and fh = ?" appear to be using the (dh, nm) index leading to horrendous performance in cases where there are very few unique dh values, and T has been vacuumed.) The optimizer is definitely picking the wrong index even when the query restricts dh to a value actually present in the table. (Yesterday, I was misled by running EXPLAIN with a value not in the table.) Tom's speculation above is mostly correct except that, as shown below, the optimizer is not choosing a seqscan. EXPERIMENT: - I created two schemas, NOVAC and VAC, each with a table T as described above. - Before loading data, I ran VACUUM ANALYZE on VAC.T. - I then started loading data. The workload is a mixture of INSERT, SELECT and UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?". PG_STAT RESULTS: Stats collection was enabled. Here is sample output during the run, (yes, those are real numbers, I just happened to run the query when the 2nd row counters hit those nice round numbers). ris=# select schemaname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where schemaname in ('novac', 'vac') and relname = 'external_file' order by 1, 2; schemaname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch ------------+--------------+----------+--------------+--------------- novac | idx_dn | 0 | 0 | 0 novac | idx_dh | 25000 | 20000 | 20000 vac | idx_dn | 16093 | 25991728 | 25991728 vac | idx_dh | 0 | 0 | 0 This shows the different index usage in the two schemas. The schema containing the vacuumed table used idx_dn exclusively; in the other schema, idx_dh was used exclusively. EXPLAIN PLAN RESULTS: This query shows the distribution of T.dh values in NOVAC: ris=# select dh, count(*) from novac.t group by dh; dh | count -----------+------- 280433162 | 5000 601454890 | 5000 Using one of these dh values, the optimizer selects idx_df as I expected (the fh values are nearly unique; 0 is the median of the possible values): ris=# explain select * from novac.t where dh = 280433162 and fh = 0; QUERY PLAN ------------------------------------------------------------------ Index Scan using idx_df on t (cost=0.00..4.83 rows=1 width=454) Index Cond: ((dh = 280433162) AND (fh = 0)) (2 rows) But in the VAC schema: ris=# select dir_hash, count(*) from vac.t group by dh; dir_hash | count -----------+------- 758082190 | 5000 980351022 | 5000 (2 rows) ris=# explain select * from vac.t where dh = 758082190 and fh = 0; QUERY PLAN ------------------------------------------------------------------ Index Scan using idx_dn on t (cost=0.00..3.68 rows=1 width=454) Index Cond: (dh = 758082190) Filter: (fh = 0) (3 rows) Here are page and tuple counts: ris=# select n.nspname, relname, relpages, reltuples from pg_class, pg_namespace n where relname = 't' and relnamespace = n.oid; nspname | relname | relpages | reltuples ----------+---------+----------+----------- novac | t | 10 | 1000 vac | t | 0 | 0 From Tom Lane's earlier email, the vacuum prior to the load explains the 0s in the second row. pg_stats has no data for table T in either schema. In the case of the VAC schema, I'm guessing this is because the table was actually empty when analyzed. CONCLUSION: The optimizer is objectively making the wrong choice. The pg_stat results included above, (as well as the block-level counts), show that the idx_df index should have been used. While the dh part of the index is not highly selective, the fh part is. And with very high ratios of fh to dh values, (as in my experiment), it matters a lot. The transaction rates measured by the test program make this clear even without looking at pg_stat numbers. Admittedly, the optimizer doesn't have much to go on. But I would think that with a WHERE clause of the form "dh = ? and fh = ?", choosing the (dh, fh) index over any other index would be a reasonable heuristic. Jack
Jack Orenstein <jack.orenstein@hds.com> writes: > - I created two schemas, NOVAC and VAC, each with a table T as described above. > - Before loading data, I ran VACUUM ANALYZE on VAC.T. > - I then started loading data. The workload is a mixture of INSERT, SELECT and > UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?". Basically your problem here is that vacuum records the size of the table as zero (in pg_class.relpages/reltuples) and that causes the computed costs of the two indexscans to be exactly the same, so it's a tossup which one gets used. (In recent versions I think the index with higher OID would typically get chosen in a tie, but I forget if 7.4 worked that way.) 8.0 and up are smart enough not to believe pg_class.relpages anymore after you've loaded a lot of data, but 7.4 isn't. In testing similar cases here, I get reasonable cost estimates and a sane plan choice from 7.4 so long as the stats are up to date. Bottom line: you need to vacuum (or preferably analyze) *after* initially populating a table, not before. regards, tom lane
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> - I created two schemas, NOVAC and VAC, each with a table T as described above. > >> - Before loading data, I ran VACUUM ANALYZE on VAC.T. > >> - I then started loading data. The workload is a mixture of INSERT, SELECT and >> UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?". > > Basically your problem here is that vacuum records the size of the table > as zero (in pg_class.relpages/reltuples) and that causes the computed > costs of the two indexscans to be exactly the same, so it's a tossup > which one gets used. (In recent versions I think the index with higher > OID would typically get chosen in a tie, but I forget if 7.4 worked that > way.) > > 8.0 and up are smart enough not to believe pg_class.relpages anymore > after you've loaded a lot of data, but 7.4 isn't. In testing similar > cases here, I get reasonable cost estimates and a sane plan choice > from 7.4 so long as the stats are up to date. > > Bottom line: you need to vacuum (or preferably analyze) *after* > initially populating a table, not before. OK, I've added this behavior to my application. As the table is being loaded, I run VACUUM ANALYZE every 500 inserts, until we get to size 10,000. I know this is working because of application-level logging, and because I see relpages and reltuples go up. EXPLAIN says that the correct index is being used -- it didn't used to. However, pg_stat* says otherwise. In my test, I have exactly one dh value. Running EXPLAIN with this value produces a plan using idx_dh (the correct index), but pg_stats says that idx_dn is being used (see psql session below). This eventually works itself out. Eventually, the pg_stats for idx_dh start going up, showing that that index is eventually being used. But this discrepancy between EXPLAIN and actual query execution is making life very difficult. Is the discrepancy between EXPLAIN and pg_stats due to some sort of caching per connection? E.g., a connection that uses one plan for a query is stuck with that plan for that query? What would really be nice is a logging option that reported the execution plan actually used for a query. Jack ris=# \d t; Table "vac.t" Column | Type | Modifiers --------------+------------------------+----------- dh | integer | not null fh | integer | not null nm | bigint | ... Indexes: "idx_dn" btree (dh, nm) "idx_dh" btree (dh, fh) ris=# select dh, count(*) from t group by dh; dh | count -----------+------- 589849733 | 19890 (1 row) ris=# explain select * from t where dh = 589849733 and fh = 0; QUERY PLAN ------------------------------------------------------------------ Index Scan using idx_dh on t (cost=0.00..5.26 rows=2 width=570) Index Cond: ((dh = 589849733) AND (fh = 0)) (2 rows) ris=# select schemaname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where schemaname = 'vac' and relname = 't' order by 1, 2; schemaname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch ------------+--------------+----------+--------------+--------------- vac | idx_dn | 31315 | 122773990 | 122773990 vac | idx_dh | 0 | 0 | 0
Jack Orenstein <jack.orenstein@hds.com> writes: > EXPLAIN says that the correct index is being used -- it didn't used > to. However, pg_stat* says otherwise. In my test, I have exactly one > dh value. Running EXPLAIN with this value produces a plan using idx_dh > (the correct index), but pg_stats says that idx_dn is being used (see > psql session below). Yeah, if you are using cached plans (via PREPARE or plpgsql functions) then the plan stays the same for the life of the session ... pre 8.3 that is. regards, tom lane
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> EXPLAIN says that the correct index is being used -- it didn't used >> to. However, pg_stat* says otherwise. In my test, I have exactly one >> dh value. Running EXPLAIN with this value produces a plan using idx_dh >> (the correct index), but pg_stats says that idx_dn is being used (see >> psql session below). > > Yeah, if you are using cached plans (via PREPARE or plpgsql functions) > then the plan stays the same for the life of the session ... pre 8.3 > that is. Thank you. One of the great things about postgres is being able to understand things all the way down with the patient help of the actual developers. Jack