Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Поиск

Список

Период

Сортировка

От	Rafia Sabih
Тема	Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Дата	5 июня 2019 г. 19:14:14
Msg-id	CA+FpmFffDyU_FTf8c3a9+GWqMmsQ9xy6Dibc0Ja9v4BeYzjXow@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [PATCH] Incremental sort (was: PoC: Partial sort) (James Coleman <jtc331@gmail.com>)
Ответы	Re: [PATCH] Incremental sort (was: PoC: Partial sort) (James Coleman <jtc331@gmail.com>) Re: [PATCH] Incremental sort (was: PoC: Partial sort) (Tomas Vondra <tomas.vondra@2ndquadrant.com>) Re: [PATCH] Incremental sort (was: PoC: Partial sort) (Simon Riggs <simon@2ndquadrant.com>)
Список	pgsql-hackers

Дерево обсуждения

On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:
>
> On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Currently, the costing logic (cost_incremental_sort) assumes all groups
> > have the same size, which is fine for uniform distributions. But for
> > skewed distributions, that may be an issue.
> >
> > Consider for example a table with 1M rows, two columns, 100 groups in each
> > column, and index on the first one.
> >
> >     CREATE table t (a INT, b INT);
> >
> >     INSERT INTO t SELECT 100*random(), 100*random()
> >       FROM generate_series(1,1000000);
> >
> > Now, let's do a simple limit query to find the first row:
> >
> >     SELECT * FROM t ORDER BU a, b LIMIT 1;
> >
> > In this case the current costing logic is fine - the groups are close to
> > average size, and we only need to sort the first group, i.e. 1% of data.
> >
> > Now, let's say the first group is much larger:
> >
> >
> >     INSERT INTO t SELECT 0, 100*random()
> >       FROM generate_series(1,900000) s(i);
> >
> >     INSERT INTO t SELECT 100*random(), 100*random()
> >       FROM generate_series(1,100000);
> >
> > That is, the first group is roughly 90% of data, but the number of groups
> > is still the same. But we need to scan 90% of data. But the average group
> > size is still the same, so the current cost model is oblivious to this.
>
> Thinking out loud here: the current implementation doesn't guarantee
> that sort groups always have the same prefix column values because
> (from code comments) "Sorting many small groups with tuplesort is
> inefficient). While this seems a reasonable optimization, I think it's
> possible that thinking steered away from an optimization in the
> opposite direction. Perhaps we should always track whether or not all
> prefix tuples are the same (currently we only do that after reaching
> DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
> have tuplesort only care about the non-prefix columns (where currently
> it has to sort on all pathkey columns even though for a large group
> the prefix columns are guaranteed to be equal).
>
+1 for passing only the non-prefix columns to tuplesort.
> Essentially I'm trying to think of ways that would get us to
> comparable performance with regular sort in the case of large batch
> sizes.
>
> One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
> case where you have a very small group and then a very large batch,
> we'd lose the ability to optimize in the above way. That makes me
> wonder if we shouldn't intentionally optimize for the possibility of
> large batch sizes, since a little extra expense per group/tuple is
> more likely to be a non-concern with small groups anyway when there
> are large numbers of input tuples but a relatively small limit.
>
What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.
> Thoughts?
>
> > So I think we should look at the MCV list, and use that information when
> > computing the startup/total cost. I think using the first/largest group to
> > compute the startup cost, and the average group size for total cost would
> > do the trick.
>
+1
> I think this sounds very reasonable.
>
> > I don't think we can do much better than this during planning. There will
> > inevitably be cases where the costing model will push us to do the wrong
> > thing, in either direction. The question is how serious issue that is, and
> > whether we could remedy that during execution somehow.
> >
> > When we "incorrectly" pick full sort (when the incremental sort would be
> > faster), that's obviously sad but I think it's mostly fine because it's
> > not a regression.
> >
> > The opposite direction (picking incremental sort, while full sort being
> > faster) is probably more serious, because it's a regression between
> > releases.
> >
> > I don't think we can fully fix that by refining the cost model. We have
> > two basic options:
> >
> > 1) Argue/show that this is not an issue in practice, because (a) such
> > cases are very rare, and/or (b) the regression is limited. In short, the
> > benefits of the patch outweight the losses.
>
> My comments above go in this direction. If we can improve performance
> in the worst case, I think it's plausible this concern becomes a
> non-issue.
>
> > 2) Provide some fallback at execution time. For example, we might watch
> > the size of the group, and if we run into an unexpectedly large one we
> > might abandon the incremental sort and switch to a "full sort" mode.
>
> Are there good examples of our doing this in other types of nodes
> (whether the fallback is an entirely different algorithm/node type)? I
> like this idea in theory, but I also think it's likely it would add a
> very significant amount of complexity. The other problem is knowing
> where to draw the line: you end up creating these kinds of cliffs
> where pulling one more tuple through the incremental sort would give
> you your batch and result in not having to pull many more tuples in a
> regular sort node, but the fallback logic kicks in anyway.
>
What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

> Unrelated to all of the above: if I read the patch properly it
> intentionally excludes backwards scanning. I don't see any particular
> reason why that ought to be the case, and it seems like an odd
> limitation for the feature should it be merged. Should that be a
> blocker to merging?

Regarding this, I came across this,
/*
  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
  * bucket in tuplesortstate.
  */
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

-- 
Regards,
Rafia Sabih

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Martijn van Oosterhout
Дата: 05 июня 2019 г., 19:10:04
Сообщение: Re: [PATCH] Improve performance of NOTIFY over many databases (issueblocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

Следующее

От: Melanie Plageman
Дата: 05 июня 2019 г., 19:18:34
Сообщение: Re: Sort support for macaddr8

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Предыдущее

Следующее