Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Дата
Msg-id CAEZATCV1oE7MW+yH79-=A74DX00ZJMwUq4ke2FN0d-fzAzxfMQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 11 June 2017 at 20:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The standard way of doing this is to calculate the "standard error" of
>> the sample proportion - see, for example [3], [4]:
>>   SE = sqrt(p*(1-p)/n)
>> Note, however, that this formula assumes that the sample size n is
>> small compared to the population size N, which is not necessarily the
>> case. This can be taken into account by applying the "finite
>> population correction" (see, for example [5]), which involves
>> multiplying by an additional factor:
>>   SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
>
> It's been a long time since college statistics, but that wikipedia article
> reminds me that the binomial distribution isn't really the right thing for
> our problem anyway.  We're doing sampling without replacement, so that the
> correct model is the hypergeometric distribution.

Yes that's right.

>  The article points out
> that the binomial distribution is a good approximation as long as n << N.
> Can this FPC factor be justified as converting binomial estimates into
> hypergeometric ones, or is it ad hoc?

No, it's not just ad hoc. It comes from the variance of the
hypergeometric distribution [1] divided by the variance of a binomial
distribution [2] with p=K/N, in the notation of those articles.

This is actually a very widely used formula, used in fields like
analysis of survey data, which is inherently sampling without
replacement (assuming the questioners don't survey the same people
more than once!).

Regards,
Dean


[1] https://en.wikipedia.org/wiki/Hypergeometric_distribution
[2] https://en.wikipedia.org/wiki/Binomial_distribution



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: [HACKERS] Buildfarm failures on woodlouse (in ecpg-check)