Re: Using quicksort for every external sort run
От | Peter Geoghegan |
---|---|
Тема | Re: Using quicksort for every external sort run |
Дата | |
Msg-id | CAM3SWZRFzg1LUK8FBg_goZ8zL0n7k6q83qQjhOV8NDZioA5TEQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Ответы |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run Re: Using quicksort for every external sort run |
Список | pgsql-hackers |
On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote: > The query I'm testing is: "reindex index pgbench_accounts_pkey;" > > Now, with a maintenance_work_mem of 5MB, the most recent revision of > my patch takes about 54.2 seconds to complete this, as compared to > master's 44.4 seconds. So, clearly a noticeable regression there of > just under 20%. I did not see a regression with a 5MB > maintenance_work_mem when pgbench scale was 100, though. I've fixed this regression, and possibly all regressions where workMem > 4MB. I've done so without resorting to making the heap structure more complicated, or using a heap more often than when replacement_sort_mem is exceeded by work_mem or maintenance_work_mem (so replacement_sort_mem becomes something a bit different to what we discussed, Robert -- more on that later). This seems like an "everybody wins" situation, because in this revision the patch series is now appreciably *faster* where the amount of memory available is only a tiny fraction of the total input size. Jeff Janes deserves a lot of credit for helping me to figure out how to do this. I couldn't get over his complaint about the regression he saw a few months back. He spoke of an "anti-sweetspot" in polyphase merge, and how he suspected that to be the real culprit (after all, most of his time was spent merging, with or without the patch applied). He also said that reverting the memory batch/pool patch made things go a bit faster, somewhat ameliorating his regression (when just the quicksort patch was applied). This made no sense to me, since I understood the memory batching patch to be orthogonal to the quicksort thing, capable of being applied independently, and more or less a strict improvement on master, no matter what the variables of the sort are. Jeff's regressed case especially made no sense to me (and, I gather, to him) given that the regression involved no correlation, and so clearly wasn't reliant on generating far fewer/longer runs than the patch (that's the issue we've discussed more than any other now -- it's a red herring, it seems). As I suspected out loud on February 14th, replacement selection mostly just *masked* the real problem: the problem of palloc() fragmentation. There doesn't seem to be much of an issue with the scheduling of polyphase merging, once you fix palloc() fragmentation. I've created a new revision, incorporating this new insight. New Revision ============ Attached revision of patch series: 1. Creates a separate memory context for tuplesort's copies of caller's tuples, which can be reset at key points, avoiding fragmentation. Every SortTuple.tuple is allocated there (with trivial exception); *everything else*, including the memtuples array, is allocated in the existing tuplesort context, which becomes the parent of this new "caller's tuples" context. Roughly speaking, that means that about half of total memory for the sort is managed by each context in common cases. Even with a high work_mem memory budget, memory fragmentation could previously get so bad that tuplesort would in effect claim a share of memory from the OS that is *significantly* higher than the work_mem budget allotted to its sort. And with low work_mem settings, fragmentation previously made palloc() thrash the sort, especially during non-final merging. In this latest revision, tuplesort now almost gets to use 100% of the memory that was requested from the OS by palloc() is cases tested. 2. Loses the "quicksort with spillover" case entirely, making the quicksort patch significantly simpler. A *lot* of code was thrown out. This change is especially significant because it allowed me to remove the cost model that Robert took issue with so vocally. "Quicksort with spillover" was always far less important than the basic idea of using quicksort for external sorts, so I'm not sad to see it go. And, I thought that the cost model was pretty bad myself. 3. Fixes cost_sort(), making optimizer account for the fact that runs are now about sort_mem-sized, not (sort_mem * 2)-sized. While I was at it, I made cost_sort() more optimistic about the amount of random I/O required relative to sequential I/O. This additional change to cost_sort() was probably overdue. 4. Restores the ability of replacement selection to generate one run and avoid any merging (previously, only one really long run and one short run was possible, because at the time I conceptualized replacement selection as being all about enabling "quicksort with spillover", which quicksorted that second run in memory). This only-one-run case is the case that Robert particularly cared about, and it's fully restored when RS is in use (which can still only happen for the first run, just never for the benefit of the now-axed "quicksort with spillover" case). 5. Adds a new GUC, "replacement_sort_mem". The default setting is 16MB. Docs are included. If work_mem/maintenance_work_mem is less than or equal to this, the first (and hopefully only) run uses replacement selection. "replacement_sort_mem" is a different thing to the concept for a GUC Robert and I discussed (only the name is the same). That other concept for a GUC related to the hybrid heap/quicksort idea (it controlled how big the heap portion of memtuples was supposed to be, in a speculative world where the patch took that "hybrid" approach [1] at all). In light of this new information about palloc() fragmentation, and given the risk to tuplesort's stability posed by implementing this "hybrid" algorithm, this seems like a good compromise. I cannot see an upside to pursuing the "hybrid" approach now. I regret reversing my position on that, but that's just how it happened. Since Robert was seemingly only worried about regressions, which are fixed now for a variety of cases that I tested, I'm optimistic that this will be acceptable to him. I believe that replacement_sort_mem as implemented here is quite useful, although mostly because I see some further opportunities for it. Replacement Selection uses -------------------------- What opportunities, you ask? Maybe CREATE INDEX can be made to accept a "presorted" parameter, letting the user promise that the input is more or less presorted. This allows tuplesort to only use a tiny heap, while having it throw an error if it cannot produce one long run (i.e. CREATE INDEX is documented as raising an error if the input is not more or less presorted). The nice thing about that idea is that we can be very optimistic about the data actually being more or less presorted, so the implementation doesn't *actually* produce one long run -- it produces one long *index*, with IndexTuples passed back to nbtsort.c as soon as the heap fills for the first time, a bit like an on-the-fly merge. Unlike an on-the-fly merge, no tapes or temp files are actually involved; we write out IndexTuples by actually writing out the index optimistically. There is a significant saving by using a heap *because there is no need for a TSS_SORTEDONTAPE pass over the data*. We succeed at doing it all at once with a tiny heap, or die trying. So, in a later version of Postgres (9.7?), replacement_sort_mem becomes more important because of this "presorted" CREATE INDEX parameter. That's a relatively easy patch to write, but it's not 9.6 material. Commits ------- Note that the attached revision makes the batch memory patch the first commit in the patch series. It might be useful to get that one out of the way first, since I imagine it is now considered the least controversial, and is perhaps the simplest of the two big patches in the series. I'm not very optimistic about the memory prefetch patch 0003-* getting committed, but so far I've only seen it help, and all of my testing is based on having it applied. In any case, it's clearly way way less important than the other two patches. Testing ------- N.B.: The debug patch, 0004-*, should not be applied during benchmarking. I've used amcheck [2] to test this latest revision -- the tool ought to not see any problems with any index created with the patch applied. Reviewers might find it helpful to use amcheck, too. As 9.6 is stabilized, I anticipate that amcheck will give us a fighting chance at early detection of any bugs that might have slipped into tuplesort, or a B-Tree operator class. Since we still don't even have one single test of the external sort code [3], it's just as well. If we wanted to test external sorting, maybe we'd do that by adding tests to amcheck, that are not run by default, much like test_decoding, which tests logical decoding but is not targeted by "make installcheck"; that would allow the tests to be fairly comprehensive without being annoying. Using amcheck neatly side-steps issues with the portability of "expected" pg_regress output when collatable type sorting is tested. Thoughts? [1] http://www.postgresql.org/message-id/CA+TgmoY87y9FuZ=NE7JayH2emtovm9Jp9aLfFWunjF3utq4hfg@mail.gmail.com [2] https://commitfest.postgresql.org/9/561/ [3] http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/src/backend/utils/sort/tuplesort.c.gcov.html -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: