Обсуждение: GIN indexscans versus equality selectivity estimation
Whilst fooling around with GIN over the past few days, I noticed the following rather surprising behavior: regression=# create table t1 (f1 int[]); CREATE TABLE regression=# insert into t1 values (array[42]); INSERT 0 1 regression=# create index ti1 on t1 using gin (f1); CREATE INDEX regression=# set enable_seqscan to 0; SET regression=# explain select * from t1 where f1 = array[42]; QUERY PLAN -----------------------------------------------------------------------Seq Scan on t1 (cost=10000000000.00..10000000001.01rows=1 width=32) Filter: (f1 = '{42}'::integer[]) (2 rows) The system absolutely will not use the index in this state, even though the GIN array opclass does support equality. I dug around and eventually figured out why: 1. We don't have any pg_stats statistics in this state; but we do know reltuples = 1, since the CREATE INDEX helpfully updated the pg_class row with that information. Without stats, eqsel() falls back to a selectivity estimate of 1/numdistinct; and without stats, get_variable_numdistinct estimates the number of distinct values as equal to reltuples, if that's a small number. The upshot is that we get a selectivity estimate of exactly 1.0 for the equality condition. 2. In indxpath.c, we have the following comment and code about selecting which paths to generate bitmap scan paths for: * ... Also, pick out the ones that might be useful * as bitmap scans. For that, we must discard indexes that don'tsupport * bitmap scans, and we also are only interested in paths that have some * selectivity; we should discardanything that was generated solely for * ordering purposes. if (ipath->indexinfo->amhasgetbitmap && ipath->indexselectivity < 1.0 && !ScanDirectionIsBackward(ipath->indexscandir)) bitindexpaths = lappend(bitindexpaths, ipath); Since GIN doesn't support plain indexscans, only bitmap scans, this means that the planner simply won't use a GIN index unless the estimated selectivity is less than 1.0. There are several things we might choose to do about this: 1. Do nothing. The issue seems quite unlikely to affect anyone in the field, since in fact "use a seqscan" is probably the right answer anytime reltuples = 1; and anyway using a GIN index for plain equality is a corner case to begin with. However, it could confuse people who were doing testing (it confused me!). 2. Hack the selectivity estimators so that we don't get exactly 1.0 for this sort of situation. It might be plausible for get_variable_numdistinct to set a floor of 2 on its result, for example; or we could hack eqsel() to bound the no-stats estimate to a bit less than 1. 3. Change the indxpath.c code to have some different way of determining whether a path is sensible to use as a bitmap scan. We could for instance make the test be "path has some indexquals and/or index is partial". Or maybe there should be an exception in the case where the index type doesn't support plain indexscan. I'm not really sold on any of these alternatives. Thoughts? regards, tom lane
On 1/9/11 3:38 PM, Tom Lane wrote: > 1. Do nothing. The issue seems quite unlikely to affect anyone in the > field, since in fact "use a seqscan" is probably the right answer > anytime reltuples = 1; and anyway using a GIN index for plain equality > is a corner case to begin with. However, it could confuse people who > were doing testing (it confused me!). +1. It's an unexpected result, but not actually a bad one. It just doesn't seem worth messing with code which works in production just to help testing. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sun, Jan 9, 2011 at 6:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > or we could hack eqsel() to bound the no-stats estimate to a bit less > than 1. This seems like a pretty sensible thing to do. I can't immediately imagine a situation in which 1.0 is a sensible selectivity estimate in the no-stats case and 0.90 (say) is a major regression. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Jan 9, 2011 at 6:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> or we could hack eqsel() to bound the no-stats estimate to a bit less >> than 1. > This seems like a pretty sensible thing to do. I can't immediately > imagine a situation in which 1.0 is a sensible selectivity estimate in > the no-stats case and 0.90 (say) is a major regression. After sleeping on it, that seems like my least favorite option. It's basically a kluge, as is obvious because there's no principled way to choose what the bound is (or the minimum result from get_variable_numdistinct, if we were to hack it there). I'm currently leaning to the idea of tweaking the logic in indxpath.c; in particular, why wouldn't it be a good idea to force consideration of the bitmap path if the index type hasn't got amgettuple? If we don't, then we've completely wasted the effort spent up to that point inside find_usable_indexes. Or we could just ignore the issue; as Josh says, that's not an unreasonable option. The particular case I ran into is certainly not too compelling. I'm a bit worried though that there might be other cases where the estimator comes up with 1.0 selectivity but it'd still be worth considering a bitmap scan. regards, tom lane
On 1/10/11 7:25 AM, Tom Lane wrote: > I'm a bit worried though that there might be other > cases where the estimator comes up with 1.0 selectivity but it'd still > be worth considering a bitmap scan. Well, I think the answer is to apply the other fixes, and test. If there are other cases of selectivity=1.0, they'll show up. People are pretty fast to complain if indexes aren't used, and we have a good production test case available once you implement the other operators. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 1/10/11 7:25 AM, Tom Lane wrote: >> I'm a bit worried though that there might be other >> cases where the estimator comes up with 1.0 selectivity but it'd still >> be worth considering a bitmap scan. > Well, I think the answer is to apply the other fixes, and test. If > there are other cases of selectivity=1.0, they'll show up. People are > pretty fast to complain if indexes aren't used, and we have a good > production test case available once you implement the other operators. "Implement the other operators"? I don't think we're on the same page here. What I'm talking about is a one-line change in indxpath.c to not short-circuit consideration of a bitmap indexscan. regards, tom lane
On Mon, Jan 10, 2011 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Jan 9, 2011 at 6:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> or we could hack eqsel() to bound the no-stats estimate to a bit less >>> than 1. > >> This seems like a pretty sensible thing to do. I can't immediately >> imagine a situation in which 1.0 is a sensible selectivity estimate in >> the no-stats case and 0.90 (say) is a major regression. > > After sleeping on it, that seems like my least favorite option. It's > basically a kluge, as is obvious because there's no principled way to > choose what the bound is (or the minimum result from > get_variable_numdistinct, if we were to hack it there). Well, the general problem is that we have no reasonable way of handling planning uncertainty. We have no way of throwing our hands up in the air and saying "I really have no clue how many rows are going to come out of that node"; as far as the rest of the planning process is concerned, a selectivity estimate of 0.005 based on <column> = <some MCV with a frequency of 0.005> is exactly identical to one that results from a completely inscrutable equality condition. So while I agree with you that there's no particular principled way to choose the exact value, that doesn't strike me as a compelling argument against fixing some value. ISTM that selectivity estimates of exactly 0 and exactly 1 ought to be viewed with a healthy dose of suspicion. > I'm currently > leaning to the idea of tweaking the logic in indxpath.c; in particular, > why wouldn't it be a good idea to force consideration of the bitmap path > if the index type hasn't got amgettuple? If we don't, then we've > completely wasted the effort spent up to that point inside > find_usable_indexes. I guess the obvious question is: why wouldn't it be a good idea to force consideration of the bitmap path even if the index type DOES have amgettuple? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 10, 2011 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm currently >> leaning to the idea of tweaking the logic in indxpath.c; in particular, >> why wouldn't it be a good idea to force consideration of the bitmap path >> if the index type hasn't got amgettuple? �If we don't, then we've >> completely wasted the effort spent up to that point inside >> find_usable_indexes. > I guess the obvious question is: why wouldn't it be a good idea to > force consideration of the bitmap path even if the index type DOES > have amgettuple? Well, the motivation is what the code comment said: not to waste time uselessly considering the bitmap form of an indexscan whose only reason to live was to produce output sorted in a particular way. That's irrelevant for GIN of course, but it's entirely relevant for btree. It might be just useless over-optimization, but I don't think so -- choose_bitmap_and is O(N^2) in the number of paths submitted to it, so adding a lot of uninteresting paths doesn't seem smart. A small variant of the approach would be to only reject paths that have non-empty pathkeys. That's not a *sufficient* condition, because a path could have both pathkeys and good selectivity --- but it could be added onto the selectivity test. regards, tom lane