Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

Поиск
Список
Период
Сортировка
От Mark Kirkwood
Тема Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Дата
Msg-id f6ced95d-54d0-9b01-1588-f768a2b2a097@catalyst.net.nz
обсуждение исходный текст
Ответ на [HACKERS] Make ANALYZE more selective about what is a "most common value"?  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Список pgsql-hackers
On 05/06/17 09:30, Tom Lane wrote:

> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 30000 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.
>
> Second, the code also has a rule that potential MCVs need to have an
> estimated frequency at least 25% larger than what it thinks the "average"
> value's frequency is.  A rule of that general form seems like a good idea,
> but I now think the 25% threshold is far too small to do anything useful.
> In particular, in any case like this where there are more distinct values
> than there are sample rows, the "average frequency" estimate will
> correspond to less than one occurrence in the sample, so that this rule is
> totally useless to filter anything that we would otherwise consider as an
> MCV.  I wonder if we shouldn't make it be "at least double the estimated
> average frequency".
>

Or possibly calculate the sample standard deviation and make use of that 
to help decide on a more flexible cutoff than twice the avg frequency?

Are there any research papers that might help us here (I'm drowning in a 
sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.

On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.

Cheers

Mark



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jerry Sievers
Дата:
Сообщение: Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table