Re: serious under-estimation of n_distinct for clustered distributions

Поиск

Список

Период

Сортировка

От	Stefan Andreatta
Тема	Re: serious under-estimation of n_distinct for clustered distributions
Дата	14 января 2013 г. 07:35:38
Msg-id	50F3B543.8020205@synedra.com обсуждение исходный текст
Ответ на	Re: serious under-estimation of n_distinct for clustered distributions (Stefan Andreatta <s.andreatta@synedra.com>)
Ответы	Re: serious under-estimation of n_distinct for clustered distributions
Список	pgsql-performance

Дерево обсуждения

A status update on this problem:

1.) Workarounds (setting n_distinct manually) are tested and - as far as
workarounds go - OK.

2.) Source of the problem and possible solution:

The source of these troubles is the sampling method employed in
src/backend/commands/analyze.c. Judging from Tom Lane's comment for the
original implementation in 2004 this has never been thought to be
perfect. Does anybody see a chance to improve that part? Should this
discussion be taken elsewhere? Is there any input from my side that
could help?


btw: I do find this problem to be very frequent in our databases. And
considering the commonplace conditions leading to it, I would expect
many systems to be affected. But searching the forums and the web I
hardly found any references to it - which amazes me to no end.


Best Regards,
Stefan



On 12/30/2012 07:02 PM, Stefan Andreatta wrote:
> On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
>> On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com>
>> wrote:
>>> Now, the 2005 discussion goes into great detail on the advantages and
>>> disadvantages of this algorithm, particularly when using small
>>> sample sizes,
>>> and several alternatives are discussed. I do not know whether
>>> anything has
>>> been changed after that, but I know that the very distinct problem,
>>> which I
>>> will focus on here, still persists.
>>
>> It's a really hard problem to solve satisfactorily. It's a problem
>> that has been studied in much detail. Yes, the algorithm used is still
>> the same. See the comments within src/backend/commands/analyze.c (IBM
>> Research Report RJ 10025 is referenced there).
>
> Thanks a lot for this information! I looked through the code a bit.
> The Haas & Stokes Formula is fine. The problem really lies with the
> two phase random selection procedure:
>
> Starting from line 1039, there is a comment:
>  * As of May 2004 we use a new two-stage method: Stage one selects up
>  * to targrows random blocks (or all blocks, if there aren't so many).
>  * Stage two scans these blocks and uses the Vitter algorithm to create
>  * a random sample of targrows rows (or less, if there are less in the
>  * sample of blocks). The two stages are executed simultaneously: each
>  * block is processed as soon as stage one returns its number and while
>  * the rows are read stage two controls which ones are to be inserted
>  * into the sample.
>  *
>  * Although every row has an equal chance of ending up in the final
>  * sample, this sampling method is not perfect: not every possible
>  * sample has an equal chance of being selected. For large relations
>  * the number of different blocks represented by the sample tends to be
>  * too small. We can live with that for now. Improvements are welcome.
>
>
> Now the problem with clustered data is, that the probability of
> sampling a value twice is much higher when the same page is repeatedly
> sampled. As stage one takes a random sample of pages, and stage two
> samples rows from these pages, the probability of visiting the same
> page twice (or more often) is much higher than if random rows were
> selected from the whole table. Hence we get a lot more multiple values
> for clustered data and we end up with the severe under-estimation we
> can see in those cases.
>
> Probabilities do my brain in, as usual, but I tested the procedure for
> my test data with a simple python script. There is absolutely nothing
> wrong with the implementation. It seems to be a purely statistical
> problem.
>
> Not everything may be hopeless though ;-) The problem could
> theoretically be avoided if random rows were selected from the whole
> table. Again, that may not be feasible - the two phase approach was
> probably not implemented for nothing.
>
> Another possible solution would be to avoid much of the resampling
> (not all) in phase two. For that - in theory - every page visited
> would have to get a lower weight, so that revisiting this page is not
> any more likely as rows were selected from the whole column. That does
> not sound easy or elegant to implement. But perhaps there is some
> clever algorithm - unfortunately I do not know.
>
>
>> The general advice here is:
>>
>> 1) Increase default_statistics_target for the column.
>>
>> 2) If that doesn't help, consider using the following DDL:
>>
>> alter table foo alter column bar set ( n_distinct = 5.0);
>
> Yes, I will probably have to live with that for now - I will come back
> to these workarounds with one or two questions.
>
> Thanks again & Regards,
> Seefan
>
>

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: serious under-estimation of n_distinct for clustered distributions