Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От Dmitry Dolgov
Тема Re: POC: GROUP BY optimization
Дата
Msg-id CA+q6zcVjCVkAhoLD6rjDoPGdfWDjQLD=N2tvfH1yU+Eq+iby9w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
> On Fri, May 3, 2019 at 11:55 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> I don't recall the exact details of the discussion, since most of it
> happened almost a year ago, but the main concern about the original
> costing approach is that it very much assumes uniform distribution.
>
> For example if you have N tuples with M groups (which are not computed
> using estimate_num_groups IIRC, and we can hardly do much better), then
> the patch assumed each groups is ~N/M rows and used that for computing
> the number of comparisons for a particular sequence of ORDER BY clauses.
>
> But that may result in pretty significant errors in causes with a couple
> of very large groups.

Yes, you're right, the current version of the patch assumes uniform
distribution of values between these M groups. After some thinking I've got an
idea that maybe it's better to not try to find out what would be acceptable for
both uniform and non uniform distributions, but just do not reorder at all if
there are any significant deviations from what seems to be a "best case",
namely when values distributed among groups relatively uniformly and there are
no doubts about how complicated is to compare them.

Saying that, it's possible on top of the current implementation to check:

* dependency coefficient between columns (if I understand correctly, non
  uniform distributions of values between the groups a.k.a few large groups
  should be visible from an extended statistics as a significant dependency)

* largest/smallest group in MCV doesn't differ too much from an expected
  "average" group

* average width and comparison procost are similar

If everything fits (which I hope would be most of the time) - apply reorder,
otherwise use whatever order was specified (which leaves the room for manual
reordering for relatively rare situations). Does it makes sense?

> But what I think we could do is using largest possible group instead of
> the average one. So for example when estimating number of comparisons
> for columns (c1,..,cN), you could look at MCV lists for these columns
> and compute
>
>     m(i) = Max(largest group in MCV list for i-th column)
>
> and then use Min(m(1), ..., m(k)) when estimating the number of
> comparisons.

I see the idea, but I'm a bit confused about how to get a largest group for a
MCV list? I mean there is a list of most common values per column with
frequencies, and similar stuff for multi columns statistics, but how to get a
size of those groups?



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: initdb recommendations
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: initdb recommendations