Обсуждение: BUG #8598: Row count estimates of partial indexes
The following bug has been logged on the website: Bug reference: 8598 Logged by: Marko Tiikkaja Email address: marko@joh.to PostgreSQL version: 9.1.9 Operating system: Linux Description: Hi, We have the following partial index on a small subset of a larger table: "index_transactions_transaction_balance_details" btree (transactionid) WHERE NOT processed AND accountbalancesdailyid IS NOT NULL However, querying with the WHERE clause completely ignores the pg_class.reltuples value for the index: =# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed AND accountbalancesdailyid IS NOT NULL; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_transactions_transaction_balance_details on transactions (cost=0.00..3883160.47 rows=66259403 width=130) (actual time=0.033..18.268 rows=13962 loops=1) Total runtime: 18.874 ms (2 rows) .. which makes for some silly joins when this index is part of a larger query. Is this expected on 9.1? Has this been fixed in more recent versions?
"marko@joh.to" <marko@joh.to> wrote:=0A=0A>=A0=A0 "index_transactions_trans= action_balance_details" btree=0A> (transactionid)=0A> WHERE NOT processed A= ND accountbalancesdailyid IS NOT NULL=0A>=0A> However, querying with the WH= ERE clause completely ignores the=0A> pg_class.reltuples value for the inde= x:=0A=0AWhat is the pg_class.reltuples value for the index?=0A=0A> =3D# EXP= LAIN ANALYZE SELECT * FROM transactions WHERE NOT processed=0A> AND account= balancesdailyid IS NOT NULL;=0A=0A> Index Scan using index_transactions_tra= nsaction_balance_details=0A> on transactions=A0 (cost=3D0.00..3883160.47 ro= ws=3D66259403 width=3D130)=0A> (actual time=3D0.033..18.268 rows=3D13962 lo= ops=3D1)=0A=0A> .. which makes for some silly joins when this index is part= of a=0A> larger query.=0A>=0A> Is this expected on 9.1?=A0 Has this been f= ixed in more recent=0A> versions?=0A=0APlease provide a little more informa= tion:=0A=0ASELECT version();=0ASELECT name, current_setting(name), source= =0A=A0 FROM pg_settings=0A=A0 WHERE source NOT IN ('default', 'override');= =0A=0AWhat VACUUM or ANALYZE commands are run outside of autovacuum?=0A=0A-= -=0AKevin Grittner=0AEDB: http://www.enterprisedb.com=0AThe Enterprise Post= greSQL Company
marko@joh.to writes: > We have the following partial index on a small subset of a larger table: > "index_transactions_transaction_balance_details" btree (transactionid) > WHERE NOT processed AND accountbalancesdailyid IS NOT NULL > However, querying with the WHERE clause completely ignores the > pg_class.reltuples value for the index: Yup. Row count estimates are derived by estimating the selectivity of the given WHERE clauses and multiplying by the (estimated) current table size. In the particular case you show here, with a partial index that *exactly* matches the WHERE clause, we could get a better answer by looking at the index size --- but that doesn't scale to any less simplistic case, such as a query with additional WHERE clauses. It's also important to realize that reltuples for an index is a whole lot less trustworthy than it is for a table; ANALYZE doesn't update the former, for example. And scaling from the last-reported VACUUM stats to current reality is going to be shakier. So on the whole, I don't think this would be a good idea. regards, tom lane
On 11/17/13, 5:29 PM, Tom Lane wrote: > marko@joh.to writes: >> We have the following partial index on a small subset of a larger table: >> "index_transactions_transaction_balance_details" btree (transactionid) >> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL >> However, querying with the WHERE clause completely ignores the >> pg_class.reltuples value for the index: > > Yup. Row count estimates are derived by estimating the selectivity of the > given WHERE clauses and multiplying by the (estimated) current table size. > In the particular case you show here, with a partial index that *exactly* > matches the WHERE clause, we could get a better answer by looking at the > index size --- but that doesn't scale to any less simplistic case, such > as a query with additional WHERE clauses. > > It's also important to realize that reltuples for an index is a whole lot > less trustworthy than it is for a table; ANALYZE doesn't update the > former, for example. And scaling from the last-reported VACUUM stats > to current reality is going to be shakier. > > So on the whole, I don't think this would be a good idea. Any suggestions for a workaround? When reading this index as a part of a bigger query the horrible estimate ensures that nobody's having fun. I currently have something like: SELECT * FROM (SELECT * FROM transactions WHERE <partial index> LIMIT 25000) transactions JOIN .. And I *really* don't like that as a workaround. Regards, Marko Tiikkaja
Marko Tiikkaja <marko@joh.to> writes: > Any suggestions for a workaround? When reading this index as a part of > a bigger query the horrible estimate ensures that nobody's having fun. Why is the estimate so bad? I suppose the answer is that those two columns are very strongly correlated. Maybe you could refactor your data representation to avoid that? The long-term answer as far as Postgres is concerned is to learn about cross-column correlations, but that's not happening in the near future. regards, tom lane
On Sun, Nov 17, 2013 at 11:55 AM, Marko Tiikkaja <marko@joh.to> wrote: > On 11/17/13, 5:29 PM, Tom Lane wrote: > >> marko@joh.to writes: >> >>> We have the following partial index on a small subset of a larger table: >>> "index_transactions_transaction_balance_details" btree >>> (transactionid) >>> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL >>> However, querying with the WHERE clause completely ignores the >>> pg_class.reltuples value for the index: >>> >> >> Yup. Row count estimates are derived by estimating the selectivity of the >> given WHERE clauses and multiplying by the (estimated) current table size. >> In the particular case you show here, with a partial index that *exactly* >> matches the WHERE clause, we could get a better answer by looking at the >> index size --- but that doesn't scale to any less simplistic case, such >> as a query with additional WHERE clauses. >> >> It's also important to realize that reltuples for an index is a whole lot >> less trustworthy than it is for a table; ANALYZE doesn't update the >> former, for example. And scaling from the last-reported VACUUM stats >> to current reality is going to be shakier. >> >> So on the whole, I don't think this would be a good idea. >> > > Any suggestions for a workaround? When reading this index as a part of a > bigger query the horrible estimate ensures that nobody's having fun. I > currently have something like: > Define a new column which is true iff the where condition is true? It sounds like that one magic combination has a meaning all of its own, so it would make sense to encode it in one column. > > SELECT * FROM > (SELECT * FROM transactions WHERE <partial index> LIMIT 25000) > transactions > JOIN .. > > And I *really* don't like that as a workaround. > I've wanted a function that always returns true, but which the planner things returns false most of the time, for use in such situations. It looks like you can make one of these with a compiled module (by creating an operator and then wrapping that in a function), but I have not found a way to do it without using C. (CREATE FUNCTION takes a COST and ROWS, but not a SELECTIVITY.) Cheers, Jeff
On 11/17/13 9:18 PM, Tom Lane wrote: > Marko Tiikkaja <marko@joh.to> writes: >> Any suggestions for a workaround? When reading this index as a part of >> a bigger query the horrible estimate ensures that nobody's having fun. > > Why is the estimate so bad? I suppose the answer is that those two > columns are very strongly correlated. Maybe you could refactor your > data representation to avoid that? I'll look into that, thanks. > The long-term answer as far as Postgres is concerned is to learn about > cross-column correlations, but that's not happening in the near future. I'm completely clueless about how the planner works, but wouldn't it be easier to have some kind of separate stats for the conditions in partial indexes? It seems better in all cases than trying infer the stats from cross-column correlations, even if we had that. Regards, Marko Tiikkaja
Marko Tiikkaja <marko@joh.to> writes: > On 11/17/13 9:18 PM, Tom Lane wrote: >> The long-term answer as far as Postgres is concerned is to learn about >> cross-column correlations, but that's not happening in the near future. > I'm completely clueless about how the planner works, but wouldn't it be > easier to have some kind of separate stats for the conditions in partial > indexes? It seems better in all cases than trying infer the stats from > cross-column correlations, even if we had that. There's been some discussion of providing a way to hint to ANALYZE about which combinations of columns are worth gathering cross-column statistics for. But partial index predicates seem like a pretty bad mechanism for that. regards, tom lane
On Mon, Nov 18, 2013 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Tiikkaja <marko@joh.to> writes: >> On 11/17/13 9:18 PM, Tom Lane wrote: >>> The long-term answer as far as Postgres is concerned is to learn about >>> cross-column correlations, but that's not happening in the near future. > >> I'm completely clueless about how the planner works, but wouldn't it be >> easier to have some kind of separate stats for the conditions in partial >> indexes? It seems better in all cases than trying infer the stats from >> cross-column correlations, even if we had that. > > There's been some discussion of providing a way to hint to ANALYZE about > which combinations of columns are worth gathering cross-column statistics > for. But partial index predicates seem like a pretty bad mechanism for > that. > > regards, tom lane Why? If there's a partial index on some predicate, it does mean the predicate is of common occurence or at least important and it's quite expectable that more precise estimations regarding those queries valuable. Analyze should simply record the selectivity of partial index predicates as it would the MFV of the boolean variable equal to the predicate's result, and modifying the MFV estimation code to look up for those specific stats doesn't seem too difficult.