Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

Поиск

Список

Период

Сортировка

От	Mark Kirkwood
Тема	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Дата	6 июня 2017 г. 00:41:24
Msg-id	f6ced95d-54d0-9b01-1588-f768a2b2a097@catalyst.net.nz обсуждение исходный текст
Ответ на	[HACKERS] Make ANALYZE more selective about what is a "most common value"? (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Список	pgsql-hackers

Дерево обсуждения

On 05/06/17 09:30, Tom Lane wrote:

> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 30000 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.
>
> Second, the code also has a rule that potential MCVs need to have an
> estimated frequency at least 25% larger than what it thinks the "average"
> value's frequency is.  A rule of that general form seems like a good idea,
> but I now think the 25% threshold is far too small to do anything useful.
> In particular, in any case like this where there are more distinct values
> than there are sample rows, the "average frequency" estimate will
> correspond to less than one occurrence in the sample, so that this rule is
> totally useless to filter anything that we would otherwise consider as an
> MCV.  I wonder if we shouldn't make it be "at least double the estimated
> average frequency".
>

Or possibly calculate the sample standard deviation and make use of that 
to help decide on a more flexible cutoff than twice the avg frequency?

Are there any research papers that might help us here (I'm drowning in a 
sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.

On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.

Cheers

Mark

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?