Re: Choosing values for multivariate MCV lists

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: Choosing values for multivariate MCV lists
Дата	22 июня 2019 г. 14:10:52
Msg-id	20190622141052.5csokbp3sx577mub@development обсуждение
Ответ на	Re: Choosing values for multivariate MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы	Re: Choosing values for multivariate MCV lists Re: Choosing values for multivariate MCV lists
Список	pgsql-hackers

Дерево обсуждения

On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:
>On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>>
>> >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>> >though, because that would seem likely to put too much weight on the
>> >least commonly occurring values.
>>
>> But would that be an issue, or a good thing? I mean, as long as the item
>> is above mincount, we take the counts as reliable. As I explained, my
>> motivation for proposing that was that both
>>
>>    ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)
>>
>> and
>>
>>    ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)
>>
>> have exactly the same Abs(freq - base_freq), but I think we both agree
>> that the first misestimate is much more dangerous, because it's off by six
>> orders of magnitude.
>>
>
>Hmm, that's a good example. That definitely suggests that we should be
>trying to minimise the relative error, but also perhaps that what we
>should be looking at is actually just the ratio freq / base_freq,
>rather than their difference.
>

Attached are patches that implement this (well, the first one optimizes
how the base frequency is computed, which helps for high statistic target
values). The 0002 part picks the items based on

   Max(freq/base_freq, base_freq/freq)

It did help with the addresses data set quite a bit, but I'm sure it needs
more testing. I've also tried using

   freq * abs(freq - base_freq)

but the estimates were not as good.

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Choosing values for multivariate MCV lists

Вложения