Re: Using quicksort for every external sort run

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Using quicksort for every external sort run
Дата
Msg-id CA+TgmoYtkWC2KqQ8wZvu32bcd5MS2mpWdV3HsF7O0v0w_c37CQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> This looks like voodoo to me.  I assume you tested it and maybe it
>> gives correct answers, but it's got to be some kind of world record
>> for number of arbitrary constants per SLOC, and there's no real
>> justification for any of it.  The comments say, essentially, well, we
>> do this because it works.  But suppose I try it on some new piece of
>> hardware and it doesn't work well.  What do I do?  Email the author
>> and ask him to tweak the arbitrary constants?
>
> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
> something, though.
>
> MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
> And that part (the MaxAllocSize part of my patch) only defines a point
> after which we require a really favorable case for replacement
> selection/quicksort with spillover to proceed. It's a safety valve. We
> try to err on the side of not using replacement selection.

Sure, there are arbitrary numbers all over the code, driven by
empirical observations about what factors are important to model.  But
this is not that.  You don't have a thing called seq_page_cost and a
thing called cpu_tuple_cost and then say, well, empirically the ratio
is about 100:1, so let's make the former 1 and the latter 0.01.  You
just have some numbers, and it's not clear what, if anything, they
actually represent.  In the space of 7 lines of code, you introduce 9
nameless constants:

The crossover point is clamped to a minimum of 40% [constant #1] and a
maximum of 85% [constant #2] when the size of the SortTuple array is
no more than MaxAllocSize.  Between those bounds, the crossover point
is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
[constant #5] of estimated average tuple size.  On the other hand,
when the estimated average tuple size exceeds MaxAllocSize, the
crossover point is either 90% [constant #6] or 95% [constant #7]
depending on whether the average tuple size is greater than 32 bytes
[constant #8].  But if the row count hit is less than 50% [constant
#9] of the rows we've already seen, then we ignore it and do not use
selection.

You make no attempt to justify why any of these numbers are correct,
or what underlying physical reality they represent.  The comment which
describes the manner in which crossover point is computed for
SortTuple volumes under 1GB says "Starting from a threshold of 90%,
refund 7.5% per 32 byte average-size-increment."  That is a precise
restatement of what the code does, but it doesn't attempt to explain
why it's a good idea.  Perhaps the reader should infer that the
crossover point drops as the tuples get bigger, except that in the
over-1GB case, a larger tuple size causes the crossover point to go
*up* while in the under-1GB case, a larger tuple size causes the
crossover point to go *down*.  Concretely, if we're sorting 44,739,242
224-byte tuples, the estimated crossover point is 40%.  If we're
sorting 44,739,243 244-byte tuples, the estimated crossover point is
95%.  That's an extremely sharp discontinuity, and it seems very
unlikely that any real system behaves that way.

I'm prepared to concede that constant #9 - ignoring the input row
estimate if we've already seen twice that many rows - probably doesn't
need a whole lot of justification here, and what justification it does
need is provided by the fact that (we think) replacement selection
only wins when there are going to be less than 2 quicksorted runs.
But the other 8 constants here have to have reasons why they exist,
what they represent, and why they have the values they do, and that
explanation needs to be something that can be understood by people
besides you.  The overall cost model needs some explanation of the
theory of operation, too.

In my opinion, reasoning in terms of a crossover point is a strange
way of approaching the problem.  What would be more typical at least
in our code, and I suspect in general, is do a cost estimate of using
selection and a cost estimate of not using selection and compare them.
Replacement selection has a CPU cost and an I/O cost, each of which is
estimable based on the tuple count, chosen comparator, and expected
I/O volume.  Quicksort has those same costs, in different amounts.  If
those respective costs are accurately estimated, then you can pick the
strategy with the lower cost and expect to win.

>> I wonder if there's a way to accomplish what you're trying to do here
>> that avoids the need to have a cost model at all.  As I understand it,
>> and please correct me wherever I go off the rails, the situation is:
>>
>> 1. If we're sorting a large amount of data, such that we can't fit it
>> all in memory, we will need to produce a number of sorted runs and
>> then merge those runs.  If we generate each run using a heap with
>> replacement selection, rather than quicksort, we will produce runs
>> that are, on the average, about twice as long, which means that we
>> will have fewer runs to merge at the end.
>>
>> 2. Replacement selection is slower than quicksort on a per-tuple
>> basis.  Furthermore, merging more runs isn't necessarily any slower
>> than merging fewer runs.  Therefore, building runs via replacement
>> selection tends to lose even though it tends to reduce the number of
>> runs to merge.  Even when having a larger number of runs results in an
>> increase in the number merge passes, we save so much time building the
>> runs that we often (maybe not always) still come out ahead.
>
> I'm with you so far. I'll only add: doing multiple passes ought to be
> very rare anyway.
>
>> 3. However, when replacement selection would result in a single run,
>> and quicksort results in multiple runs, using quicksort loses.  This
>> is especially true when we the amount of data we have is between one
>> and two times work_mem.  If we fit everything into one run, we do not
>> need to write any data to tape, but if we overflow by even a single
>> tuple, we have to write a lot of data to tape.
>
> No, this is where you lose me. I think that it's basically not true
> that replacement selection can ever be faster than quicksort, even in
> the cases where the conventional wisdom would have you believe so
> (e.g. what you say here). Unless you have very little memory relative
> to data size, or something along those lines. The conventional wisdom
> obviously needs some revision, but it was perfectly correct in the
> 1970s and 1980s.
>
> However, where replacement selection can still help is avoiding I/O
> *entirely*. If we can avoid spilling 95% of tuples in the first place,
> and quicksort the remaining (heapified) tuples that were not spilled,
> and merge an in-memory run with an on-tape run, then we can win big.

That's pretty much what I was trying to say, except that I'm curious
to know whether replacement selection can win when it manages to
generate a vastly longer run than what we get from quicksorting.  Say
quicksorting produces 10, or 100, or 1000 tapes, and replacement
selection produces 1 due to a favorable data distribution.

> Quicksort is not amenable to incremental spilling at all. I call this
> "quicksort with spillover" (it is a secondary optimization that the
> patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
> discontinuity in the cost function of sorts. That could really help
> with admission control, and simplifying the optimizer, making merge
> joins less scary. So with the patch, "quicksort with spillover" and
> "replacement selection" are almost synonymous, except that we
> acknowledge the historic importance of replacement selection to some
> degree. The patch completely discards the conventional use of
> replacement selection -- it just preserves its priority queue (heap)
> implementation where incrementalism is thought to be particularly
> useful (avoiding I/O entirely).
>
> But this comparison has nothing to do with comparing the master branch
> with my patch, since the master branch never attempts to avoid I/O
> having committed to an external sort. It uses replacement selection in
> a way that is consistent with the conventional wisdom, wisdom which
> has now been shown to be obsolete.
>
> BTW, I think that abandoning incrementalism (replacement selection)
> will have future benefits for memory management. I bet we can get away
> with one big palloc() for second or subsequent runs that are
> quicksorted, greatly reducing palloc() overhead and waste there, too.
>
>> If this is correct so far, then I wonder if we could do this: Forget
>> replacement selection.  Always build runs by quicksorting.  However,
>> when dumping the first run to tape, dump it a little at a time rather
>> than all at once.  If the input ends before we've completely written
>> the run, then we've got all of run 1 in memory and run 0 split between
>> memory and tape.  So we don't need to do any extra I/O; we can do a
>> merge between run 1 and the portion of run 0 which is on tape.  When
>> the tape is exhausted, we only need to finish merging the in-memory
>> tails of the two runs.
>
> My first attempt at this -- before I realized that replacement
> selection was just not a very good algorithm, due to the upsides not
> remotely offsetting the downsides on modern hardware -- was a hybrid
> between quicksort and replacement selection.
>
> The problem is that there is too much repeated work. If you spill like
> this, you have to quicksort everything again. The replacement
> selection queue keeps track of a currentRun and nextRun, to avoid
> this, but quicksort can't really do that well.

I agree, but that's not what I proposed.  You don't want to keep
re-sorting to incorporate new tuples into the run, but if you've got
1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
[which is entirely in memory].  In other words, yes, quicksorting
doesn't let you add things to the sort incrementally, but you can
still write out the run incrementally, writing only as many tuples as
you need to dump to get the rest of the input data into memory.

>> I also wonder if you've thought about the case where we are asked to
>> sort an enormous amount of data that is already in order, or very
>> nearly in order (2,1,4,3,6,5,8,7,...).  It seems worth including a
>> check to see whether the low value of run N+1 is higher than the high
>> value of run N, and if so, append it to the existing run rather than
>> starting a new one.  In some cases this could completely eliminate the
>> final merge pass at very low cost, which seems likely to be
>> worthwhile.
>
> While I initially shared this intuition -- that replacement selection
> could hardly be beaten by a simple hybrid sort-merge strategy for
> almost sorted input -- I changed my mind. I simply did not see any
> evidence for it. I may have missed something, but it really does not
> appear to be worth while. The quicksort fallback to insertion sort
> also does well with presorted input. The merge is very cheap (over and
> above reading one big run off disk) for presorted input under most
> circumstances. A cost model adds a lot of complexity, which I hesitate
> to add without clear benefits.

I don't think you need any kind of cost model to implement the
approach of appending to an existing run when the values in the new
run are strictly greater.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: On columnar storage (2)
Следующее
От: Robert Haas
Дата:
Сообщение: Re: A Typo in regress/sql/privileges.sql