Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Дата
Msg-id 55B9CB7B.7070204@iki.fi
обсуждение исходный текст
Ответ на Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On 07/30/2015 04:05 AM, Peter Geoghegan wrote:
> Patch -- "quicksort with spillover"
> =========================
>
> With the attached patch, I propose to add an additional, better "one
> run special case" optimization. This new special case "splits" the
> single run into 2 "subruns". One of the runs is comprised of whatever
> tuples were in memory when the caller finished passing tuples to
> tuplesort. To sort that, we use quicksort, which in general has
> various properties that make it much faster than heapsort -- it's a
> cache oblivious algorithm, which is important these days. The other
> "subrun" is whatever tuples were on-tape when tuplesort_performsort()
> was called. This will often be a minority of the total, but certainly
> never much more than half. This is already sorted when
> tuplesort_performsort() is reached. This spillover is already inserted
> at the front of the sorted-on-tape tuples, and so already has
> reasonably good cache characteristics.
>
> With the patch, we perform an on-the-fly merge that is somewhat
> similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case,
> except that one of the runs is in memory, and is potentially much
> larger than the on-tape/disk run (but never much smaller), and is
> quicksorted. The existing "merge sort" case similarly is only used
> when the caller specified !randomAccess.

Hmm. You don't really need to merge the in-memory array into the tape, 
as you know that all the tuples in the in-memory must come after the 
tuples already on the tape. You can just return all the tuples from the 
tape first, and then all the tuples from the array.

So here's a shorter/different explanation of this optimization: When 
it's time to perform the sort, instead of draining the in-memory heap 
one tuple at a time to the last tape, you sort the heap with quicksort, 
and pretend that the sorted heap belongs to the last tape, after all the 
other tuples in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs? 
Quicksorting the heap and keeping it as an array in memory is surely 
always faster than heapsorting and pushing it to the tape.

I think it'd make sense to structure the code differently, to match the 
way I described this optimization above. Instead of adding a new 
tuplesort state for this, abstract this in the logical tape code. Add a 
function to attach an in-memory "tail" to a tape, and have 
LogicalTapeRead() read from the tail after reading the on-disk tape. The 
rest of the code wouldn't need to care that sometimes part of the tape 
is actually in memory.

It should be pretty easy to support randomAccess too. If you think of 
the in-memory heap as a tail of the last tape, you can easily move 
backwards from the in-memory heap back to the on-disk tape, too.

> +     * Note that there might actually be 2 runs, but only the
> +     * contents of one of them went to tape, and so we can
> +     * safely "pretend" that there is only 1 run (since we're
> +     * about to give up on the idea of the memtuples array being
> +     * a heap).  This means that if our sort happened to require
> +     * random access, the similar "single run" optimization
> +     * below (which sets TSS_SORTEDONTAPE) might not be used at
> +     * all.  This is because dumping all tuples out might have
> +     * forced an otherwise equivalent randomAccess case to
> +     * acknowledge a second run, which we can avoid.

Is that really true? We don't start a second run until we have to, i.e. 
when it's time to dump the first tuple of the second run to tape. So I 
don't think the case you describe above, where you have two runs but 
only one of them has tuples on disk, can actually happen.

> Performance
> ==========

Impressive!

> Predictability
> ==========

Even more impressive!

> Future work
> =========

As an extra optimization, you could delay quicksorting the in-memory 
array until it's time to read the first tuple from it. If the caller 
reads only the top-N tuples from the sort for some reason (other than 
LIMIT, which we already optimize for), that could avoid a lot of work.

- Heikki




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: security labels on databases are bad for dump & restore
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Don'st start streaming after creating a slot in pg_receivexlog