Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
От | Bruce Momjian |
---|---|
Тема | Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect) |
Дата | |
Msg-id | 199907290021.UAA24604@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect) (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple
se lect)
(Philip Warner <pjw@rhyme.com.au>)
Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect) (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > >> BTW, this argument proves rigorously that the selectivity of a search > >> for any value other than the MFOV is not more than 0.5, so there is some > >> basis for my intuition that eqsel should not return a value above 0.5. > >> So, in the cases where eqsel does not know the exact value being > >> searched for, I'd still be inclined to cap its result at 0.5. > > > I don't follow this. If the most frequent value occurs 95% of the time, > > wouldn't the selectivity be 0.95? > > If you are searching for the most frequent value, then the selectivity > estimate should indeed be 0.95. If you are searching for anything else, > the selectivity estimate ought to be 0.05 or less. If you don't know > what value you will be searching for, which number should you use? You are going to love this: 0.95 * 0.95 + 0.05 * 0.05 This is because with a 95% of one value, you would think the ask for that value 95% of the time, and another value 5% of the time. The last 0.05 is not really accurate. It assumes there are only two unique values in the table, which may be wrong, but it is close enough. > > The unsupported assumption here is that if the table contains 95% > occurrence of a particular value, then the odds are also 95% (or at > least high) that that's the value you are searching for in any given > query that has an "= something" WHERE qual. Yes. > That assumption is pretty reasonable in some cases (such as your > example earlier of "WHERE state = 'PA'" in a Pennsylvania-local > database), but it falls down badly in others, such as where the > most common value is NULL or an empty string or some other indication > that there's no useful data. In that sort of situation it's actually > pretty unlikely that the user will be searching for field = > most-common-value ... but the system probably has no way to know that. Well, if null is most common, it is very probable they would be looking for col IS NULL. > I wonder whether it would help to add even more data to pg_statistic. > For example, suppose we store the fraction of the columns that are NULL, > plus the most frequently occurring *non null* value, plus the fraction > of the columns that are that value. This would allow us to be very > smart about columns in which "no data" is represented by NULL (as a good > DB designer would do): That would be nice. > > selectivity of "IS NULL": NULLfraction > > selectivity of "IS NOT NULL": 1 - NULLfraction > > selectivity of "= X" for a known non-null constant X: > if X == MFOV: MFOVfraction > else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction) > > selectivity of "= X" when X is not known a priori, but presumably is not > null: > MIN(MFOVfraction, 1-NULLfraction) > > Both of the MIN()s are upper bounds, so multiplying them by a > fudge-factor < 1 would be reasonable. Yes, I am with you here. > These rules would guarantee small selectivity values when either > MFOVfraction or 1-NULLfraction is small. It still wouldn't cost > much, since I believe VACUUM ANALYZE is counting nulls already... Yes, it is. Sounds nice. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления: