Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Дата
Msg-id CAAaqYe_a7Umxn6J+GGXGu00f+Jwmcq83gh3dj6tC4RR75+yjAw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Rafia Sabih <rafia.pghackers@gmail.com>)
Список pgsql-hackers
On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Thanks for the rebase! I think the patch is in pretty good shape - I'm
> sure we'll find ways to make it more efficient etc. but IMO that's fine
> and should not prevent getting it committed.

Thank you for the in-depth review!

> Currently, the costing logic (cost_incremental_sort) assumes all groups
> have the same size, which is fine for uniform distributions. But for
> skewed distributions, that may be an issue.
>
> Consider for example a table with 1M rows, two columns, 100 groups in each
> column, and index on the first one.
>
>     CREATE table t (a INT, b INT);
>
>     INSERT INTO t SELECT 100*random(), 100*random()
>       FROM generate_series(1,1000000);
>
> Now, let's do a simple limit query to find the first row:
>
>     SELECT * FROM t ORDER BU a, b LIMIT 1;
>
> In this case the current costing logic is fine - the groups are close to
> average size, and we only need to sort the first group, i.e. 1% of data.
>
> Now, let's say the first group is much larger:
>
>
>     INSERT INTO t SELECT 0, 100*random()
>       FROM generate_series(1,900000) s(i);
>
>     INSERT INTO t SELECT 100*random(), 100*random()
>       FROM generate_series(1,100000);
>
> That is, the first group is roughly 90% of data, but the number of groups
> is still the same. But we need to scan 90% of data. But the average group
> size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

Thoughts?

> So I think we should look at the MCV list, and use that information when
> computing the startup/total cost. I think using the first/largest group to
> compute the startup cost, and the average group size for total cost would
> do the trick.

I think this sounds very reasonable.

> I don't think we can do much better than this during planning. There will
> inevitably be cases where the costing model will push us to do the wrong
> thing, in either direction. The question is how serious issue that is, and
> whether we could remedy that during execution somehow.
>
> When we "incorrectly" pick full sort (when the incremental sort would be
> faster), that's obviously sad but I think it's mostly fine because it's
> not a regression.
>
> The opposite direction (picking incremental sort, while full sort being
> faster) is probably more serious, because it's a regression between
> releases.
>
> I don't think we can fully fix that by refining the cost model. We have
> two basic options:
>
> 1) Argue/show that this is not an issue in practice, because (a) such
> cases are very rare, and/or (b) the regression is limited. In short, the
> benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

> 2) Provide some fallback at execution time. For example, we might watch
> the size of the group, and if we run into an unexpectedly large one we
> might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

James Coleman



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Darafei "Komяpa" Praliaskouski
Дата:
Сообщение: Re: How do we support FULL JOIN on PostGIS types?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: How do we support FULL JOIN on PostGIS types?