Re: Using quicksort for every external sort run

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Using quicksort for every external sort run
Дата
Msg-id CAM3SWZQO7nY_UthWEH8JOEu3rSwRVyJOW3K+D9yjLXhvLq9TmA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: Using quicksort for every external sort run  (Robert Haas <robertmhaas@gmail.com>)
Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan <pg@heroku.com> wrote:
> It wasn't my original insight that replacement selection has become
> all but obsolete. It took me a while to come around to that point of
> view.

Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

"By comparison, OpenVMS sort uses a pure replacement-selection sort to
generate runs (Knuth, 1973). Replacement-selection is best for a
memory-constrained environment. On average, replacement-selection
generates runs that are twice as large as available memory, while the
QuickSort runs are typically less than half of available memory.
However, in a memory-rich environment, QuickSort is faster because it
is simpler, makes fewer exchanges on average, and has superior address
locality to exploit processor caching. "

(I believe that the authors state that "QuickSort runs are typically
less than half of available memory" because of the use of explicit
asynchronous I/O in each thread, which doesn't apply to us).

The paper also has very good analysis of the economics of sorting:

"Even for surprisingly large sorts, it is economical to perform the
sort in one pass."

Of course, memory capacities have scaled enormously in the 20 years
since this analysis was performed, so the analysis applies even at the
very low end these days. The high capacity memory system that they
advocate to get a one pass sort (instead of having faster disks) had
100MB of memory, which is of course tiny by contemporary standards. If
you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB
of memory. The smallest EC2 instance size, the t2.nano, costs about
$1.10 to run for one week, and has 0.5GB of memory.

The economics of using 4MB or even 20MB to sort 10GB of data are
already preposterously bad for everyone that runs a database server,
no matter how budget conscious they may be. I can reluctantly accept
that we need to still use a heap with very low work_mem settings to
avoid the risk of a regression (in the event of a strong correlation)
on general principle, but I'm well justified in proposing "just don't
do that" as the best practical advice.

I thought I had your agreement on that point, Robert; is that actually the case?

[1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf

-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Rushabh Lathia
Дата:
Сообщение: Re: Optimization for updating foreign tables in Postgres FDW
Следующее
От: Andres Freund
Дата:
Сообщение: Re: 2016-01 Commitfest