Re: Choosing values for multivariate MCV lists

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: Choosing values for multivariate MCV lists
Дата
Msg-id CAEZATCWMDpQUmnSXtRCq_CxLXUoLETra=7SoQ3P5ydiFcse7wQ@mail.gmail.com
обсуждение исходный текст
Ответ на Choosing values for multivariate MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Choosing values for multivariate MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> The current implementation of multi-column MCV lists (added in this
> cycle) uses a fairly simple algorithm to pick combinations to include in
> the MCV list. We just compute a minimum number of occurences, and then
> include all entries sampled more often. See get_mincount_for_mcv_list().
>
> [snip]
>
> This however means that by choosing the MCV entries solely based on the
> number of occurrences in the sample, we end up with MCV lists where vast
> majority of entries has NULL street name.
>
> That's why we got such poor estimate in the example query, despite the
> fact that the city/street combination is the most frequent in the data
> set (with non-NULL street name).
>

I think the fact that they're NULL is a bit of a red herring because
we're treating NULL just like any other value. The same thing would
happen if there were some other very common non-NULL value that
dominated the dataset.

> The other weird thing is that frequency of NULL street names is fairly
> uniform in the whole data set. In total about 50% addresses match that,
> and for individual cities it's generally between 25% and 100%, so the
> estimate is less than 2x off in those cases.
>
> But for addresses with non-NULL street names, the situation is very
> different. Some street names are unique to a single city, etc.
>
> Overall, this means we end up with MCV list with entries representing
> the mostly-uniform part of the data set, instead of prefering the
> entries that are truly skewed.
>
> So I'm wondering if we could/should rethink the algorithm, so look at
> the frequency and base_frequency, and maybe pick entries based on their
> ratio, or something like that.
>

Hmm, interesting. I think I would like to see a more rigorous
justification for changing the algorithm deciding which values to
keep.

If I've understood correctly, I think the problem is this: The
mincount calculation is a good way of identifying MCV candidates to
keep, because it ensures that we don't keep values that don't appear
sufficiently often to produce accurate estimates, and ideally we'd
keep everything with count >= mincount. However, in the case were
there are more than stats_target items with count >= mincount, simply
ordering by count and keeping the most commonly seen values isn't
necessarily the best strategy in the case of multivariate statistics.

To identify what the best strategy might be, I think you need to
examine the errors that would occur as a result of *not* keeping a
value in the multivariate MCV list. Given a value that appears with
count >= mincount, N*freq ought to be a reasonable estimate for the
actual number of occurrences of that value in the table, and
N*base_freq ought to be a reasonable estimate for the
univariate-stats-based estimate that it would be given if it weren't
kept in the multivariate MCV list. So the absolute error resulting
from not keeping that value would be

  N * Abs(freq - base_freq)

But then I think we ought to take into account how often we're likely
to get that error. If we're simply picking values at random, the
likelihood of getting that value is just its frequency, so the average
average absolute error would be

  Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )

which suggests that, if we wanted to reduce the average absolute error
of the estimates, we should order by freq*Abs(freq-base_freq) and keep
the top n in the MCV list.

On the other hand, if we wanted to reduce the average *relative* error
of the estimates, we might instead order by Abs(freq-base_freq).

> For example, we might sort the entries by
>
>     abs(freq - base_freq) / freq
>

I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.

> Of course, this is a challenging problem, for a couple of reasons.
>
> Firstly, picking simply the most frequent groups is very simple and it
> gives us additional information about the largest group (which may be
> useful elsewhere, e.g. the incremental sort patch).
>

Yes, you would have to keep in mind that changing the algorithm would
mean that the MCV list no longer represented all the most common
values. For example, it would no longer be valid to assume that no
value appeared more often than the first value in the MCV list. I'm
not sure that we currently do that though.

> Secondly, if the correlated combinations are less frequent, how reliably
> can we even estimate the frequency from a sample? The combination in the
> example query was ~0.02% of the data, so how likely it's to be sampled?
>

I think that's OK as long as we keep the mincount filter in the new
algorithm. I think some experimentation is definitely worthwhile here,
but it looks plausible that a decent approach might be:

1). Discard all values with count < mincount.
2). Order by freq*Abs(freq-base_freq) (or possibly just
Abs(freq-base_freq)) and keep the top n, where n is the stats target.
3). Perhaps re-sort by count, so the final list is in frequency order
again? Not sure if that's a desirable property to keep.

Regards,
Dean



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Inconsistent error message wording for REINDEX CONCURRENTLY
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Race conditions with TAP test for syncrep