Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: POC: GROUP BY optimization
Дата
Msg-id 20200622222530.qq22kdkukz7nnqmd@development
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, Jun 22, 2020 at 11:50:49AM -0400, Robert Haas wrote:
>On Sat, May 16, 2020 at 10:56 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> The local effects are trivial - it's for example reordering the
>> pathkeys to make the explicit sort as cheap as possible. This thread
>> already discussed a number of things to base this on - ndistinct for
>> columns, cost of comparison function, ... In any case, this is
>> something we can decide locally, when building the grouping paths.
>>
>> The global effects are much harder to tackle, because the decision
>> can't be made locally when building the grouping paths. It requires
>> changes both below and above the point where we build grouping paths.
>>
>> An example of a decision we need to make before we even get to
>> building a grouping path is which index paths to build. Currently we
>> only build index paths with "useful" pathkeys, and without tweaking
>> that we'll never even see the index in add_paths_to_grouping_rel().
>>
>> But there are also decisions that can be made only after we build the
>> grouping paths. For example, we may have both GROUP BY and ORDER BY,
>> and there is no "always correct" way to combine those. In some cases
>> it may be correct to use the same pathkeys, in other cases it's
>> better to use different ones (which will require an extra Sort, with
>> additional cost).
>>
>> So I don't think there will be a single "interesting" grouping
>> pathkeys (i.e. root->group_pathkeys), but a collection of pathkeys.
>> And we'll need to build grouping paths for all of those, and leave
>> the planner to eventually pick the one giving us the cheapest plan
>> ...
>
> I agree with all of this and I think it's really good analysis. Part
> of the reason why the planner isn't that sophisticated in this area is
> that, for a long time, we didn't use paths at this level, and so it
> was much harder to write any kind of viable patch to consider
> alternatives.  With Tom's planner path-ification word there should be
> a lot more potential for optimization here, but, as you say, we need
> to do that by leveraging the existing costing machinery, not just via
> simple heuristics.

Agreed.

> It also strikes me that one of the problems in this area is that the
> statistics we currently gather don't seem to be entirely useful or
> reliable for aggregate planning.

I don't think that's an immediate issue. I don't think anyone started
implementing these optimizations and decided not to do that because of
lack of statistics.

The reason why we don't have those optimizations is more that we decided
not to even consider those optimizations, possibly because of planner
limitations. Of course, the optimizations may require additional stats,
but that's only step #2.

> I wonder if there are extensions to the extended statistics mechanism,
> or even just more things we should gather during a routine ANALYZE,
> that would enable us to estimate things better here. The most obvious
> thing is that n_distinct is often wildly inaccurate, but it's deeper
> than that. For instance, we need some way to estimate how many groups
> you're going to get when you filter on a and then group by b that
> doesn't assume uniform distribution. And we need also need something
> that doesn't just output a number of groups, but gives you some inkling
> of the sizes of those groups: 1 giant group and a bunch of little ones
> isn't the same as a bunch of equally sized groups. I don't know that
> these are things we really have much chance of figuring out with the
> currently-available information, though.
>

Not sure. I think the extended stats we have now (ndistinct coeffs,
multi-column MCV lists) should be good basis for these decisions, even
with skewed data. Of course, we may need to add something else, but I'm
not sure what would that be.

The main limitation of extended stats is it's at the per-table level.
Once you do a join, it's all over - we don't have that capability :-(

> Sorry if this is hijacking the thread a bit; I don't mean to
> discourage work on this specific patch. I'm just wondering if we need
> to think a little bigger to see our way to a good solution.
>

I think it's fine continuing on this optimization. Either we'll be able
to get it done with existing stats, or we'll find what other stats we
need to sufficiently good heuristics.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)
Следующее
От: David Rowley
Дата:
Сообщение: Re: Parallel Seq Scan vs kernel read ahead