Re: database sorting algorithms.

Поиск
Список
Период
Сортировка
От Francisco Olarte
Тема Re: database sorting algorithms.
Дата
Msg-id CA+bJJbwBOn9kDAnQv9pMrx3-DvG8Mxh+30h7rRBveN7vKTWtdw@mail.gmail.com
обсуждение исходный текст
Ответ на database sorting algorithms.  (Jian He <hejian.mark@gmail.com>)
Список pgsql-general
Jian He:

On Sat, May 1, 2021 at 9:07 AM Jian He <hejian.mark@gmail.com> wrote:
> Been self study Database, from database I deep dived into sorting algorithms.

Peek a good book, because that is a hairy topic with lot of math and
other previous knowledge required.

> Databases can do in-memory QuickSort. It also has an on-disk MergeSort.

If you ignore top-n and other stuff, just full sorts, databases uses
two types of sorting, internal sorts, which are the ones used when you
can fit your data on ram, and external sorts, when they have to spill
to tape/disk/whatever.

For internal sorts you can use quicksort, heapsort, mergesort, even
insertion sort. Quicksort with fallback to insertion is a popular one
( quicksort and merge sort are divide and conquer algorithms, but once
they have divided enough it is easier to fallback to a simpler
algorithm ).

When the data no longers fit in ram you need to make an external sort.
The classic way to do this is to sort chunks as big as they fit in
ram, write presorted chunks to disk and then merge those chunks. These
is called merge sort too, and is similar to a memory merge sort, but
not the same ( when I've done it I use two way merge for memory merge
sorts, but N-way for disk sort ( current memories are so big I never
need to do more than 1 merge pass since the tape era )

> For MergeSort: I follow this tutorial https://youtu.be/6pV2IF0fgKY?t=1108  (around 1 minutes only)
> But I am still not fully understanding about nlogn. I understand how many passes it will take, that is logn.
> Yes each pass will sort N elements.
> But I still don't get the N stand for in n*logn.

Basically you do logn passes over N elements.

> Why does each pass take  n time to sort it?

I have not time for looking at the video, but basic binary merge sort
is N log N because you have logN passes and each one must merge N
elements. Imagine you do an 8 elements pure merge sort.
1st you split it into 4+4 elements, recurse to sort them and merge
them.  Merging is an N-complexity op because you need to read the 8
elements.
4 elements are split into 2+2, same thing, merge neads four reads, but
you have two merges, so eight reads again.
2 elements are split into 1+1, two reads for merging, but you have
four chungk, so eight again.

Total, 3 passes ( log2(8) ) of eight reads.

Real sorts are more complex, but the N is always present, as you
always need to at least read the full input set to sort.

Note, in external sort you may have N elements, of which only M fit in ram.
You do C=N/M chunks , these need C*MlogM = N logMtime to be sorted
and C*M=N time to be written to and read from disk.
But to merge C chunks you basically do logC comparison for element, so
add N*logC N(logN-logM). If you add appropiate constants and add all
you'll find the final result is O(NlogN).

Francisco Olarte.








Francisco Olarte.



В списке pgsql-general по дате отправления:

Предыдущее
От: Jian He
Дата:
Сообщение: database sorting algorithms.
Следующее
От: Gavan Schneider
Дата:
Сообщение: Re: database sorting algorithms.