Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
Дата
Msg-id 25713.933205461@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)  (Bruce Momjian <maillist@candle.pha.pa.us>)
Ответы Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)  (Bruce Momjian <maillist@candle.pha.pa.us>)
Список pgsql-hackers
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> BTW, this argument proves rigorously that the selectivity of a search
>> for any value other than the MFOV is not more than 0.5, so there is some
>> basis for my intuition that eqsel should not return a value above 0.5.
>> So, in the cases where eqsel does not know the exact value being
>> searched for, I'd still be inclined to cap its result at 0.5.

> I don't follow this. If the most frequent value occurs 95% of the time,
> wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95.  If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less.  If you don't know
what value you will be searching for, which number should you use?

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data.  In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value.  This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

selectivity of "IS NULL": NULLfraction

selectivity of "IS NOT NULL": 1 - NULLfraction

selectivity of "= X" for a known non-null constant X:if X == MFOV: MFOVfractionelse: MIN(MFOVfraction,
1-MFOVfraction-NULLfraction)

selectivity of "= X" when X is not known a priori, but presumably is not
null:MIN(MFOVfraction, 1-NULLfraction)

Both of the MIN()s are upper bounds, so multiplying them by a
fudge-factor < 1 would be reasonable.

These rules would guarantee small selectivity values when either
MFOVfraction or 1-NULLfraction is small.  It still wouldn't cost
much, since I believe VACUUM ANALYZE is counting nulls already...
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] pg_dump not dumping all tables
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] pg_dump not dumping all tables