Re: [HACKERS] Index Puzzle for you

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] Index Puzzle for you
Дата
Msg-id 3841.946480843@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] Index Puzzle for you  (Kristofer Munn <kmunn@munn.com>)
Список pgsql-hackers
Kristofer Munn <kmunn@munn.com> writes:
> Good catch!  There were 296 possible issues the table.  One had 86,544
> articles associated with it.  The next highest was 5,949.  Then the
> numbers drop to 630, 506, 412, 184 and then the rest are all under 62.
> Out of curiosity, how does vacuum decide on the large estimate?

The estimate is made using a "disbursion" statistic calculated by VACUUM
ANALYZE.  I don't know the whole statistical theory here, but if you
think of disbursion as the fraction of values in the column that are
equal to the most common value, you won't be too far off.

I gather from your numbers that your table has about 1 million rows,
so the disbursion of ixissue would be somewhere around 86544/1000000.

The planner uses the disbursion in a way that amounts to assuming that
any "WHERE column = constant" search is in fact searching for the most
common value, so we get an estimate of returned rows that is in the
vicinity of the number of rows with the most common value.  (It's not
exact, first because VACUUM can't estimate that number perfectly
accurately, and second because the disbursion actually has some second-
order terms in it too.)

When the most common value is much more common than anything else,
this essentially means that queries are always optimized for retrieving
the most common value, even when they're retrieving some other value.
In your particular case, the optimizer is estimating that the runtime
of an index scan that needs to retrieve almost 10% of the rows in the
table will be worse than the runtime of a plain sequential scan.  I'm
not sure if that's right or not (the cost models could use more work),
but the first-order mistake is that the estimate of retrieved rows is
way off --- unless you are actually retrieving that one hugely popular
issue.

In current sources (7.0-to-be), VACUUM records the most common value
along with the disbursion, and the planner checks to see if the
"constant" in the WHERE clause is that value or not.  If not, it doesn't
use the disbursion straight-up, but a smaller estimate.  This helps a
good deal on drastically skewed column distributions such as you are
describing.  It's still easily fooled :-(, but it's hard to see how to
do much better without expending a lot more space to store a lot more
statistics.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Adriaan Joubert
Дата:
Сообщение: Re: [HACKERS] Index corruption
Следующее
От: Adriaan Joubert
Дата:
Сообщение: Re: [HACKERS] Index corruption