Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Дата
Msg-id CAAaqYe-4CPXDAn9c4NyAurGCrpOweR8BGQgo7RdvXaHLgzOMWA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Rafia Sabih <rafia.pghackers@gmail.com>)
Ответы Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > > 2) Provide some fallback at execution time. For example, we might watch
> > > the size of the group, and if we run into an unexpectedly large one we
> > > might abandon the incremental sort and switch to a "full sort" mode.
> >
> > Are there good examples of our doing this in other types of nodes
> > (whether the fallback is an entirely different algorithm/node type)? I
> > like this idea in theory, but I also think it's likely it would add a
> > very significant amount of complexity. The other problem is knowing
> > where to draw the line: you end up creating these kinds of cliffs
> > where pulling one more tuple through the incremental sort would give
> > you your batch and result in not having to pull many more tuples in a
> > regular sort node, but the fallback logic kicks in anyway.
> >
> What about having some simple mechanism for this, like if we encounter
> the group with more tuples than the one estimated then simply switch
> to normal sort for the remaining tuples, as the estimates does not
> hold true anyway. Atleast this will not give issues of having
> regressions of incremental sort being too bad than the normal sort.
> I mean having something like this, populate the tuplesortstate and
> keep on counting the number of tuples in a group, if they are within
> the budget call tuplesort_performsort, otherwise put all the tuples in
> the tuplesort and then call tuplesort_performsort. We may have an
> additional field in IncrementalSortState to save the estimated size of
> each group. I am assuming that we use MCV lists to approximate better
> the group sizes as suggested above by Tomas.

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

> > Unrelated to all of the above: if I read the patch properly it
> > intentionally excludes backwards scanning. I don't see any particular
> > reason why that ought to be the case, and it seems like an odd
> > limitation for the feature should it be merged. Should that be a
> > blocker to merging?
>
> Regarding this, I came across this,
> /*
>   * Incremental sort can't be used with either EXEC_FLAG_REWIND,
>   * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
>   * bucket in tuplesortstate.
>   */
> I think that is quite understandable. How are you planning to support
> backwards scan for this? In other words, when will incremental sort be
> useful for backward scan.

For some reason I was thinking we'd need it to support backwards scans
to be able to handle DESC sort on the index, but I've tested and
confirmed that already works. I suppose that's because the index scan
provides that ordering and the sort node doesn't need to reverse the
direction of what's provided to it. That's not particularly obvious to
someone newer to the codebase; I'm not sure if that's documented
anywhere.

> On a different note, I can't stop imagining this operator on the lines
> similar to parallel-append, wherein multiple workers can sort the
> different groups independently at the same time.

That is an interesting idea. I suppose it'd be particularly valuable
if somehow there a node that was generating each batch in parallel
already, though I'm not sure at first though what kind of query or
node would result in that. I also wonder if (assuming that weren't the
case) it would be much of an improvement since a single thread would
have to generate each batch anyway; I'm not sure if the overhead of
farming each batch out to a worker would actually be useful or if the
real blocker is the base scan.

At the very least it's an interesting idea.

---

I've been writing down notes here, and I realized that my test case
from far upthread is actually a useful setup to see how much overhead
is involved in sorting each batch individually, since it sets up data
with each batch only containing 1 tuple. That particular case is one
we could easily optimize anyway in the code and skip sorting
altogether -- might be a useful enhancement.

I hope to do some more testing and then report back with the results.

James Coleman



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: POC: Cleaning up orphaned files using undo logs
Следующее
От: James Coleman
Дата:
Сообщение: Re: Index Skip Scan