Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Поиск

Список

Период

Сортировка

От	Dean Rasheed
Тема	Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Дата	10 января 2019 г. 18:20:51
Msg-id	CAEZATCVEU=ZZ31yqQY=nUg+p7dBLxqJg6iDN-9MoE81uYs-8dw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы	Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список	pgsql-hackers

Дерево обсуждения

On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 1/8/19 3:18 PM, Dean Rasheed wrote:
> > So actually, the estimate for a group of values will be either the MCV
> > item's frequency (if the MCV item is kept), or (roughly) the MCV
> > item's base_frequency (if the MCV item is not kept). That suggests
> > that we should simply keep items that are significantly more or less
> > common than the item's base frequency -- i.e., keep rule (b) and ditch
> > rule (a).
> >
>
> Hmmm, but won't that interfere with how we with how we extrapolate the
> MCV estimate to the non-MCV part? Currently the patch does what you
> proposed, i.e.
>
>     other_sel = simple_sel - mcv_basesel;
>
> I'm worried that if we only include the items that are significantly
> more or less common than the base frequency, it may skew the other_sel
> estimate.
>

I don't see how that would skew other_sel. Items close to the base
frequency would also tend to be close to simple_sel, making other_sel
approximately zero, so excluding them should have little effect.
However...

Re-reading the thread where we enhanced the per-column MCV stats last
year [1], it was actually the case that an algorithm based on just
looking at the relative standard error worked pretty well for a very
wide range of data distributions.

The final algorithm chosen in analyze_mcv_list() was only a marginal
improvement on that, and was directly based upon the fact that, in the
univariate statistics case, all the values not included in the MCV
list are assigned the same selectivity. However, that's not the case
for multivariate stats, because each group not included in the
multivariate MCV list gets assigned a different selectivity based on
its per-column stats.

So perhaps what we should do for multivariate stats is simply use the
relative standard error approach (i.e., reuse the patch in [2] with a
20% RSE cutoff). That had a lot of testing at the time, against a wide
range of data distributions, and proved to be very good, not to
mention being very simple.

That approach would encompass both groups more and less common than
the base frequency, because it relies entirely on the group appearing
enough times in the sample to infer that any errors on the resulting
estimates will be reasonably well controlled. It wouldn't actually
look at the base frequency at all in deciding which items to keep.

Moreover, if the group appears sufficiently often in the sample to
justify being kept, each of the individual column values must also
appear at least that often as well, which means that the errors on the
base frequency estimate are also well controlled. That was one of my
concerns about other algorithms such as "keep items significantly more
or less common than the base frequency" -- in the less common case,
there's no lower bound on the number of occurrences seen, and so no
guarantee that the errors are kept under control.

Regards,
Dean

[1]
https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Pavel Stehule
Дата: 10 января 2019 г., 18:15:18
Сообщение: Re: PostgreSQL vs SQL/XML Standards

Следующее

От: Tomas Vondra
Дата: 10 января 2019 г., 18:45:23
Сообщение: Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Предыдущее

Следующее