Re: No merge sort?
От | Ron Peacetree |
---|---|
Тема | Re: No merge sort? |
Дата | |
Msg-id | FE_ka.16074$ey1.1385100@newsread1.prod.itd.earthlink.net обсуждение исходный текст |
Ответ на | Re: No merge sort? ("Dann Corbit" <DCorbit@connx.com>) |
Список | pgsql-hackers |
""Dann Corbit"" <DCorbit@connx.com> wrote in message news:D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.co m... > Distribution sort is not an algorithm. It is a general technique. Both > counting sort and radix sort are distribution sorts. I think you are > talking about counting sort here. In order to insert a technique into a > database, you must solve the general case. > Terminology problem. Please look in the references I posted. > > 2= For Radix sort, that's iff ("if and only if") you use > > =characters= as the radix of a radix sort > > and do a MSB aka partition-exchange sort. The appropriate > > radix here is a =3Dfield=3D not a character. Since there are 3 > > fields vs 60 characters the problem becomes 2/3 wasted passes > > instead of 40/60. > > This is lunacy. If you count the field or use it as a radix, you will > need a radix of 40*8 bits= 320 bits. That means you will have 2^320 > 2.136e96 bins. > You only need as many bins as you have unique key values silly ;-) Remember, at its core Radix sort is just a distribution counting sort (that's the name I learned for the general technique). The simplest implementation uses bits as the atomic unit, but there's nothing saying you have to... In a DB, we know all the values of the fields we currently have stored in the DB. We can take advantage of that. As you correctly implied, the goal is to minimize the number of unique bins you have to, err, distribute, items into. That and having as few duplicates as feasible are the important points. If, for example, we have <= 255 possible values for the each radix (Using your previous example, let's say you have <= 255 unique values for the combined field "Company+Division" in the DB and the same for "Plant" ), and 4 sets of radix to sort over, it doesn't matter if we are sorting a 32b quantity using a radix of 8b, or a 4 field quantity using a radix of one field (TBF, we should use key and index techniques to minimize the amount of actual data we retrieve and manipulate from disk during the sort. We want to minimize disk I/O, particularly seeking disk I/O). The situations are analogous. Note TANSTAAFL (as Heinlein would've said): We have to use extra space for mapping the radix values to the radix keys, and our "inner loop" for the sorting operation is considerably more complicated than that for a quicksort or a mergesort. Hence the fact that even though this is an O(n) technique, in real world terms you can expect only a 2-3X performance improvement over say quicksort for realistic amounts of data. Oh and FTR, IME for most "interesting" sorts in the DB domain, even for "internal" sorting techniques, the time to read data from disk and (possibly) write results back to disk tends to dominate the time spent actually doing the internal sort... I learned tricks like this 20 years ago. I thought this stuff was part of general lore in the DB field... *shrug*
В списке pgsql-hackers по дате отправления: