Обсуждение: Timsort performance, quicksort (was: Re: Memory usage during sorting)
On 17 April 2012 13:19, Greg Stark <stark@mit.edu> wrote: > All in all I think it's handier to have a stable ORDER BY sort than an > unstable one though. So I'm not necessarily opposed to it even if it > means we're stuck using a stable sort indefinitely. I think it might be useful to disguard the stability property if it's not a real requirement, as I think doing so gives us additional leeway to compose descending runs (pre-sorted subsets) that are not in what is termed "strictly descending" order (that is, they can be a[0] >= a[1] >= a[2] >= ... and not just a[0] > a[1] > a[2] > ...). I'll share some preliminary findings on timsort performance (and, indeed, quicksort performance). I decided on two intial tests - one that tested performance of sorting against master for a reasonable case, and the other for a rather sympathetic case. The first test was for a pgbench run of a query against the dellstore database, "select * from products order by actor offset 10001", pgbench -T 300. Here, actors is a text column. I didn't look at scalar types, because presumably quicksort is going to do much better there. After running analyze on the table, the pg_stats.correlation is -0.00493428. There is at best a very weak physical to logical correlation for the column, as far as the random sampling of analyze can tell, though there may well still be many individual "runs" of ordered subsets - I should be able to instrument timsort to determine how pre-sorted data already is in a well-principled fashion, but not today. master: tps = 43.985745 (including connections establishing) tps = 43.986028 (excluding connections establishing) timsort: tps = 39.181766 (including connections establishing) tps = 39.182130 (excluding connections establishing) Here, quicksort benefits somewhat from my earlier work (though there is no SortSupport entry in any of the tests performed), as a full tuplesort specialisation is used. timsort_arg is simply a drop-in replacement for qsort_arg. It's fairly close, but quicksort does win out here, which is perhaps not hugely surprising. Timsort does of course have to maintain state like pending runs in a way that quicksort does not, and quicksort is expected to take advantage of memory locality to a greater degree. However, if we measure the expense of the sorts in pure terms of number of comparisons, an altogether different picture emerges: timsort: 119,943 quicksort: 136,071 I'd like to see what happens when timsort_arg is further specialised (not that I envisage multiple specialisations of it or anything - just a single tuplesort one). I contrived a very sympathetic test-case too. Namely, I created a new numeric column for the dellstore database's orderlines table, with a default value of "nextval('some_seq'), resulting in a pg_stat.correlation of 1. Then, I changed the actual value of the very last tuple in the table, with the intention of tripping up our quicksort "check for pre-sorted array in a single bubblesort iteration first" optimisation. Now, I'll grant you that that was a very carefully placed banana skin, but the results were even still quite surprising and interesting. With the query "select * from orderlines order by test offset 60351", a rather large gap in the performance of quicksort and timsort emerges: Master: ===== (first run) number of transactions actually processed: 1891 tps = 6.301991 (including connections establishing) tps = 6.302041 (excluding connections establishing) (second run) number of transactions actually processed: 1878 tps = 6.259945 (including connections establishing) tps = 6.260839 (excluding connections establishing) timsort: ===== (first run) number of transactions actually processed: 19997 tps = 66.655289 (including connections establishing) tps = 66.655820 (excluding connections establishing) (second run) number of transactions actually processed: 19880 tps = 66.266234 (including connections establishing) tps = 66.266810 (excluding connections establishing) I can reproduce this result consistently - these two sets of figures were produced hours apart. Number of comparisons for single execution of this more sympathetic query: Timsort: 60,351 Quicksort: 2,247,314 (yes, really) The difference here is very big, and cannot be adequately explained by the mere wastage of a single "bubble sort iteration to check if it's presorted". I have heard a few stories about weird quicksort edgecases, and you probably have too, but I don't know for sure which one this might be right now. My theory is that we're paying a high price for "going quadratic in the face of many duplicates" protection, by following Sedgewick's advice and doing a partition swap in response to the pivot and element being equal (which they pretty much always are here). This normally isn't so obvious, because of the "check for pre-sorted input" thing usually takes care of this instead. My next step is to actually try out a benchmark with a more realistic dataset, that still reasonably showcases timsort's strengths. As Tim Peters himself put it in a document that describes the algorithm, "in a database supplied by Kevin Altis, a sort on its highly skewed "on which stock exchange does this company's stock trade?" field ran over twice as fast under timsort", so I'll try and find something along those lines that is relatively easy to recreate, and relatively realistic. Certainly, highly skewed data is not at all uncommon. Just for the record, I don't have any strong opinions on: 1. What we should be doing with timsort, if anything. It is one thing to demonstrate that it's a useful algorithm under certain artificial conditions, but quite another to argue for its inclusion in Postgres, or for it being selectively used at points where that is likely to be a win, based on some criteria or another like known cardinality, physical/logical correlation or assumed costs of comparisons for each type. At the very least, it is an interesting algorithm, but without integration that makes it actually serve user needs, that's all it will ever be to us. Deciding if and when it should be used is a rather nuanced process, and I'm certainly not about to declare that we should get rid of quicksort. It does appear to be a fairly good fit to some of our requirements though. 2. Whether we should attempt to patch-up quicksort to prevent this problem. I lean towards no, since the cure may well be worse than the disease. Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > 1. What we should be doing with timsort, if anything. It is one > thing to demonstrate that it's a useful algorithm under certain > artificial conditions, but quite another to argue for its inclusion in > Postgres, or for it being selectively used at points where that is > likely to be a win, based on some criteria or another like known > cardinality, physical/logical correlation or assumed costs of > comparisons for each type. At the very least, it is an interesting > algorithm, but without integration that makes it actually serve user > needs, that's all it will ever be to us. Deciding if and when it > should be used is a rather nuanced process, and I'm certainly not > about to declare that we should get rid of quicksort. It does appear > to be a fairly good fit to some of our requirements though. I kind of understood timsort would shine in sorting text in non-C collation, because of the comparison cost. So a test in some UTF8 collation or other would be interesting, right? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 19 April 2012 19:24, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > I kind of understood timsort would shine in sorting text in non-C > collation, because of the comparison cost. So a test in some UTF8 > collation or other would be interesting, right? That's certainly the theory, yes. In practice, even though timsort lives up to its promise of significantly reducing the number of comparisons required in many common situations, my implementation does not actually perform better than qsort_arg. Even a reduction of over 10% in the number of comparisons does not result in a net performance gain. It wouldn't surprise me if the implementation used is quite suboptimal, and it might well be worth profiling and optimising. It doesn't appear to be the big win that I'd hoped for though. It's necessary to stretch the assumed cost of a comparison rather a lot further than the very common case of sorting a single key of non-c collated text for it to be worth it, and that's just too thin for me to sink more time into this right now. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Apr 18, 2012 at 9:31 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Thoughts? Interesting work. I thought about trying to code up timsort at one point, but I've been running short of round tuits. I did some quick tests of quicksort using half a million random strings. On my MacBook Pro, if the strings are in random order, quicksort takes about 12 seconds. If they are presorted, it takes about 800 ms. If they are in sorted order with an empty string appended onto the end, it takes about 25 seconds. If I modify gen_qsort_tuple.pl to perform the "presorted" input check only when n > 40, the for the presorted-except-the-last-element-is-small test drops from 25 seconds to about 21.5 seconds, without apparently harming either of the other two cases. If I remove the check altogether, the time further drops to about 13.5 seconds, the time to sort completely-ordered data rises to about 10-10.5 seconds, and the time to sort randomly ordered data still doesn't seem to change much. Based on that, I'm inclined to propose rejiggering things so that the presorted-input check runs only at the top level, and not during any recursive steps. The theory is that it won't cost much because if the data is randomly ordered we won't do many comparisons before falling out, and that seems to be true. But the only point of doing it in the recursive steps is to win when the data is partially ordered, and we actually seem to be losing big-time in that case, perhaps because when the data is partially ordered, the presort check will frequently to run through a significant part of the array - wasting cycles - but fall out before it actually gets to the end. Of course, even if we did that, it might not be as good as your timsort numbers, but that doesn't mean we shouldn't do it... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Based on that, I'm inclined to propose rejiggering things so that the > presorted-input check runs only at the top level, and not during any > recursive steps. Just a thought. What about running only every nth step. Maybe something like every 4th step. But actually I'm confused. This seems to be misguided to me. Quicksort isn't stable so even if you have a partially sorted data set the recursive partitions are going to be at best partially sorted after a pivot. I haven't walked through it but suspect even your all-but-one-sorted data set is not finding the data sorted in either partition on the next iteration. -- greg
On 24 April 2012 16:17, Robert Haas <robertmhaas@gmail.com> wrote: > If they are in sorted order with an empty string > appended onto the end, it takes about 25 seconds. That's exactly what I'd have expected, but was surprised to have not found with my own test. Perhaps it was same kind of fluke (i.e. a re-creatable one - I'm quite confident that my benchmark was not methodologically flawed, at least in execution). > Based on that, I'm inclined to propose rejiggering things so that the > presorted-input check runs only at the top level, and not during any > recursive steps. The theory is that it won't cost much because if the > data is randomly ordered we won't do many comparisons before falling > out, and that seems to be true. But the only point of doing it in the > recursive steps is to win when the data is partially ordered, and we > actually seem to be losing big-time in that case, perhaps because when > the data is partially ordered, the presort check will frequently to > run through a significant part of the array - wasting cycles - but > fall out before it actually gets to the end. That makes sense to me, but obviously more data is needed here. > Of course, even if we did that, it might not be as good as your > timsort numbers, but that doesn't mean we shouldn't do it... The frustrating thing about my timsort numbers, as I mentioned in reply to Dimitri (He modified the subject a bit, so that might appear to be a different thread to you), is that they appear to be almost consistently winning when you consider the number of comparisons, but in fact lose when you measure the duration of execution or TPS or whatever. I expected a certain amount of this, but not enough to entirely derail the case for replacing quicksort with timsort when sorting a single key of text, which is obviously the really compelling case for optimisation here. This situation is only going to be made "worse" by the work you've done on SortSupport for text, which, incidentally, I agree is worthwhile. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Apr 24, 2012 at 12:30 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Based on that, I'm inclined to propose rejiggering things so that the >> presorted-input check runs only at the top level, and not during any >> recursive steps. > > Just a thought. What about running only every nth step. Maybe > something like every 4th step. If there's actually some advantage to that, sure. But so far, it looks like less is more. > But actually I'm confused. This seems to be misguided to me. Quicksort > isn't stable so even if you have a partially sorted data set the > recursive partitions are going to be at best partially sorted after a > pivot. Exactly. That's why I think we should do it only at the topmost level, before we've done any pivoting. Doing it at any lower level is apparently a waste of energy and counterproductive. > I haven't walked through it but suspect even your > all-but-one-sorted data set is not finding > the data sorted in either partition on the next iteration. I suspect that, too. Actually, I'm a bit confused about why it's picking such terrible pivots. Our median-of-medians optimization should be doing better than this, I would think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 24, 2012 at 12:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 24 April 2012 16:17, Robert Haas <robertmhaas@gmail.com> wrote: >> If they are in sorted order with an empty string >> appended onto the end, it takes about 25 seconds. > > That's exactly what I'd have expected, but was surprised to have not > found with my own test. Perhaps it was same kind of fluke (i.e. a > re-creatable one - I'm quite confident that my benchmark was not > methodologically flawed, at least in execution). Oh, I read your results as showing something quite similar. >> Based on that, I'm inclined to propose rejiggering things so that the >> presorted-input check runs only at the top level, and not during any >> recursive steps. The theory is that it won't cost much because if the >> data is randomly ordered we won't do many comparisons before falling >> out, and that seems to be true. But the only point of doing it in the >> recursive steps is to win when the data is partially ordered, and we >> actually seem to be losing big-time in that case, perhaps because when >> the data is partially ordered, the presort check will frequently to >> run through a significant part of the array - wasting cycles - but >> fall out before it actually gets to the end. > > That makes sense to me, but obviously more data is needed here. What more data do you think is needed? I've been suspicious of that code since the first time I looked at it, and I'm now fairly well convinced that it's full of suckitude. Honestly, I'm not sure I could manage to contrive a case where that code wins if I set out to do so. >> Of course, even if we did that, it might not be as good as your >> timsort numbers, but that doesn't mean we shouldn't do it... > > The frustrating thing about my timsort numbers, as I mentioned in > reply to Dimitri (He modified the subject a bit, so that might appear > to be a different thread to you), is that they appear to be almost > consistently winning when you consider the number of comparisons, but > in fact lose when you measure the duration of execution or TPS or > whatever. I expected a certain amount of this, but not enough to > entirely derail the case for replacing quicksort with timsort when > sorting a single key of text, which is obviously the really compelling > case for optimisation here. This situation is only going to be made > "worse" by the work you've done on SortSupport for text, ... That is quite baffling. Have you profiled it at all? > ...which, > incidentally, I agree is worthwhile. Cool, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24 April 2012 22:01, Robert Haas <robertmhaas@gmail.com> wrote: >> That's exactly what I'd have expected, but was surprised to have not >> found with my own test. Perhaps it was same kind of fluke (i.e. a >> re-creatable one - I'm quite confident that my benchmark was not >> methodologically flawed, at least in execution). > > Oh, I read your results as showing something quite similar. Sorry, I think my quick reading of your e-mail somehow left me with the impression that the differences were not so pronounced for your experiment, but indeed they are. Blame it on the fact that I've been off work for a week, I suppose. >> That makes sense to me, but obviously more data is needed here. > > What more data do you think is needed? I've been suspicious of that > code since the first time I looked at it, and I'm now fairly well > convinced that it's full of suckitude. Honestly, I'm not sure I could > manage to contrive a case where that code wins if I set out to do so. Yeah, I thought that the rationale for introducing the pre-sorted check, as it appeared in a commit message, was a little weak. I don't know that I'd go as far as to say that it was full of suckitude. The worst thing about that code to my mind is that despite being highly performance critical, it has exactly no comments beyond a brief description, and the names of variables are arcanely curt. >>> Of course, even if we did that, it might not be as good as your >>> timsort numbers, but that doesn't mean we shouldn't do it... >> >> The frustrating thing about my timsort numbers, as I mentioned in >> reply to Dimitri (He modified the subject a bit, so that might appear >> to be a different thread to you), is that they appear to be almost >> consistently winning when you consider the number of comparisons, but >> in fact lose when you measure the duration of execution or TPS or >> whatever. I expected a certain amount of this, but not enough to >> entirely derail the case for replacing quicksort with timsort when >> sorting a single key of text, which is obviously the really compelling >> case for optimisation here. This situation is only going to be made >> "worse" by the work you've done on SortSupport for text, ... > > That is quite baffling. Have you profiled it at all? Yes, I have profiled it, but didn't get much further than actually producing the figures. gprof output for a backend that had a few of these queries executed against it is attached, FWIW. I'm not sure that the information is really very actionable, and I'm aware that oprofile is preferred here (besides the known deficiencies of gprof, oprofile's annotated source feature could be particularly useful here), which is something I might get around to. Just before I decided that this might not be the best way to spend my time off, but after profiling, I got as far as doing this: https://github.com/Peter2ndQuadrant/postgres/commit/cbbbd7b1d4ef82773eb272ee35a483cbf6678756 Experimentally, this appeared to make quite a big difference (at least for the limited cases that I tried), but not so much as to alter the practical implications of the outcome. It's not as if I've gone to any great lengths to try and optimise the code yet though. CPython puts the value of MIN_MERGE at 64, whereas for Java it's 32. A comment within the Java implementation for this value reads: * This is the minimum sized sequence that will be merged. Shorter * sequences will be lengthened by calling binarySort. If the entire * array is less than this length, no merges will be performed. * * This constant should be a power of two. It was 64 in Tim Peter's C * implementation, but 32 was empirically determined to work better in * this implementation. In the unlikely event that you set this constant * to be a number that's not a power of two, you'll need to change the ... * See listsort.txt for a discussion * of the minimum stack length required as a function of the length * of the array being sorted and the minimum merge sequence length. I did not determine why, and I do not have any theories as to why the even lower value of 8 (and consequently, a greater tendency towards merging rather than performing insertion sorts) was a win for me. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Вложения
On Tue, Apr 24, 2012 at 7:16 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >>> That makes sense to me, but obviously more data is needed here. >> >> What more data do you think is needed? I've been suspicious of that >> code since the first time I looked at it, and I'm now fairly well >> convinced that it's full of suckitude. Honestly, I'm not sure I could >> manage to contrive a case where that code wins if I set out to do so. > > Yeah, I thought that the rationale for introducing the pre-sorted > check, as it appeared in a commit message, was a little weak. I don't > know that I'd go as far as to say that it was full of suckitude. The > worst thing about that code to my mind is that despite being highly > performance critical, it has exactly no comments beyond a brief > description, and the names of variables are arcanely curt. Well, what I don't like about it is that it doesn't work. Having a special case for pre-sorted input makes sense to me, but doing that check recursively at every level is pointless unless it helps with almost-sorted input, and doubling the runtime doesn't meet my definition of helping. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company