Обсуждение: Sort performance cliff with small work_mem
Hi,
I spent some time performance testing sorting, and spotted a funny
phenomenon with very small work_mem settings. This example demonstrates
it well:
I sorted about 1 GB worth of pre-sorted integers, with different
settings of work_mem, between 64 kB (the minimum), and 100 kB. I also
added some logging to print out the number of "runs" performed:
work_mem elapsed time runs
-------- ------------ ----
64 kB 20167 ms 35301
65 kB 19769 ms 34454
66 kB 19844 ms 33646
67 kB 20318 ms 32840
68 kB 20022 ms 32105
69 kB 19455 ms 31403
70 kB 19509 ms 30700
71 kB 19200 ms 30057
72 kB 19248 ms 29441
73 kB 19809 ms 29071
74 kB 19840 ms 28657
75 kB 19864 ms 28281
76 kB 19634 ms 27914
77 kB 19599 ms 27557
78 kB 19429 ms 27184
79 kB 19425 ms 26845
80 kB 20196 ms 26515
81 kB 21781 ms 26192
82 kB 18883 ms 25877
83 kB 20460 ms 25548
84 kB 18910 ms 25249
85 kB 29245 ms 2009692
86 kB 26532 ms 1039496
87 kB 25126 ms 701056
88 kB 25241 ms 528867
89 kB 24235 ms 418686
90 kB 24038 ms 350528
91 kB 24803 ms 301454
92 kB 23192 ms 264434
93 kB 23111 ms 233686
94 kB 23295 ms 210807
95 kB 26602 ms 192009
96 kB 22990 ms 176289
97 kB 22700 ms 162948
98 kB 22370 ms 150727
99 kB 22686 ms 140867
100 kB 22413 ms 132217
Huh, something funny happened going from 84 kB to 85 kB. I traced it to
this piece of code in inittapes().
> /*
> * Decrease availMem to reflect the space needed for tape buffers; but
> * don't decrease it to the point that we have no room for tuples. (That
> * case is only likely to occur if sorting pass-by-value Datums; in all
> * other scenarios the memtuples[] array is unlikely to occupy more than
> * half of allowedMem. In the pass-by-value case it's not important to
> * account for tuple space, so we don't care if LACKMEM becomes
> * inaccurate.)
> */
> tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;
>
> if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
> USEMEM(state, tapeSpace);
With a small work_mem values, maxTapes is always 6, so tapeSpace is 48
kB. With a small enough work_mem, 84 kB or below in this test case,
there is not enough memory left at this point, so we don't subtract
tapeSpace. However, with a suitably evil test case, you can get
arbitrary close to the edge, so that we will subtract away almost all
the remaining memory above, leaving only a few bytes for the tuples. In
this example case, with work_mem='85 kB', each quicksorted run consists
of only 15 tuples on average.
To fix, I propose that we change the above so that we always subtract
tapeSpace, but if there is less than e.g. 32 kB of memory left after
that (including, if it went below 0), then we bump availMem back up to
32 kB. So we'd always reserve 32 kB to hold the tuples, even if that
means that we exceed 'work_mem' slightly.
Memory accounting isn't very accurate in general. Even as the code
stands, we will exceed the requested work_mem if it's small enough, and
we don't account for every small allocation and overhead anyway. So I
think that would be acceptable, even though it's kind of cheating.
Of course, sorting large sets with tiny work_mem settings isn't a very
good idea to begin with. But I'd nevertheless like to avoid
non-intuitive performance cliffs like this. It threw me off while
performance testing another patch.
- Heikki
On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > To fix, I propose that we change the above so that we always subtract > tapeSpace, but if there is less than e.g. 32 kB of memory left after that > (including, if it went below 0), then we bump availMem back up to 32 kB. So > we'd always reserve 32 kB to hold the tuples, even if that means that we > exceed 'work_mem' slightly. Sounds very reasonable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 2, 2018 at 8:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> To fix, I propose that we change the above so that we always subtract >> tapeSpace, but if there is less than e.g. 32 kB of memory left after that >> (including, if it went below 0), then we bump availMem back up to 32 kB. So >> we'd always reserve 32 kB to hold the tuples, even if that means that we >> exceed 'work_mem' slightly. > > Sounds very reasonable. +1 -- Peter Geoghegan
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> To fix, I propose that we change the above so that we always subtract
>> tapeSpace, but if there is less than e.g. 32 kB of memory left after that
>> (including, if it went below 0), then we bump availMem back up to 32 kB. So
>> we'd always reserve 32 kB to hold the tuples, even if that means that we
>> exceed 'work_mem' slightly.
> Sounds very reasonable.
Agreed. I think that was my code to start with, and the issue certainly
didn't occur to me at the time.
I don't like the idea of using hardwired "32kB" though; some multiple
of TAPE_BUFFER_OVERHEAD seems more plausible. Perhaps MINORDER times
TAPE_BUFFER_OVERHEAD would be good?
regards, tom lane
On Wed, May 2, 2018 at 8:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > With a small work_mem values, maxTapes is always 6, so tapeSpace is 48 kB. > With a small enough work_mem, 84 kB or below in this test case, there is not > enough memory left at this point, so we don't subtract tapeSpace. However, > with a suitably evil test case, you can get arbitrary close to the edge, so > that we will subtract away almost all the remaining memory above, leaving > only a few bytes for the tuples. In this example case, with work_mem='85 > kB', each quicksorted run consists of only 15 tuples on average. This is an extreme example of the memtuples array and memory for tuples becoming unbalanced. You'll have 1024 memtuples (24KiB), which is rather a lot more than the 15 tuples that you can fit in this example case. I don't feel strongly about it, but arguably the minimum additional amount of memory should be big enough that you have some chance of using all 1024 SortTuples (the whole memtuples array). -- Peter Geoghegan
On 02/05/18 19:41, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, May 2, 2018 at 11:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> To fix, I propose that we change the above so that we always subtract >>> tapeSpace, but if there is less than e.g. 32 kB of memory left after that >>> (including, if it went below 0), then we bump availMem back up to 32 kB. So >>> we'd always reserve 32 kB to hold the tuples, even if that means that we >>> exceed 'work_mem' slightly. > >> Sounds very reasonable. > > Agreed. I think that was my code to start with, and the issue certainly > didn't occur to me at the time. > > I don't like the idea of using hardwired "32kB" though; some multiple > of TAPE_BUFFER_OVERHEAD seems more plausible. Perhaps MINORDER times > TAPE_BUFFER_OVERHEAD would be good? I don't think the amount that we reserve to hold the tuples should depend on those things. The function is "allocating" memory for the tape buffers, yes, but now we're talking about keeping some memory for the tuples, while quicksorting the initial runs. That seems independent from the number of tapes or the tape buffer size. I'm not sure what you could derive that from, to make it less arbitrary. At the moment, I'm thinking of just doing something like this: /* * Minimum amount of memory reserved to hold the sorted tuples in * TSS_BUILDRUNS phase. This specifies a minimum size for the merge runs, * when work_mem is very small. */ #define MIN_TUPLE_MEMORY (32 * 1024) Independently of this, perhaps we should put in special case in dumptuples(), so that it would never create a run with fewer than maxTapes tuples. The rationale is that you'll need to hold that many tuples in memory during the merge phase anyway, so it seems silly to bail out before that while building the initial runs. You're going to exceed work_mem by the roughly same amount anyway, just in a different phase. That's not the case in this example, but it might happen when sorting extremely wide tuples. - Heikki
On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Independently of this, perhaps we should put in special case in > dumptuples(), so that it would never create a run with fewer than maxTapes > tuples. The rationale is that you'll need to hold that many tuples in memory > during the merge phase anyway, so it seems silly to bail out before that > while building the initial runs. You're going to exceed work_mem by the > roughly same amount anyway, just in a different phase. That's not the case > in this example, but it might happen when sorting extremely wide tuples. -1 from me. What about the case where only some tuples are massive? -- Peter Geoghegan
On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'm not sure what you could derive that from, to make it less arbitrary. At > the moment, I'm thinking of just doing something like this: > > /* > * Minimum amount of memory reserved to hold the sorted tuples in > * TSS_BUILDRUNS phase. This specifies a minimum size for the merge runs, > * when work_mem is very small. > */ > #define MIN_TUPLE_MEMORY (32 * 1024) If you end up doing something like this, I suggest that you also change this code to simply assign 1024 (or maybe a new preprocessor constant): state->memtupsize = Max(1024, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1); The ALLOCSET_SEPARATE_THRESHOLD part can become a static assertion. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, May 2, 2018 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Independently of this, perhaps we should put in special case in
>> dumptuples(), so that it would never create a run with fewer than maxTapes
>> tuples. The rationale is that you'll need to hold that many tuples in memory
>> during the merge phase anyway, so it seems silly to bail out before that
>> while building the initial runs. You're going to exceed work_mem by the
>> roughly same amount anyway, just in a different phase. That's not the case
>> in this example, but it might happen when sorting extremely wide tuples.
> -1 from me. What about the case where only some tuples are massive?
Well, what about it? If there are just a few wide tuples, then the peak
memory consumption will depend on how many of those happen to be in memory
at the same time ... but we have zero control over that in the merge
phase, so why sweat about it here? I think Heikki's got a good idea about
setting a lower bound on the number of tuples we'll hold in memory during
run creation.
regards, tom lane
On Wed, May 2, 2018 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> -1 from me. What about the case where only some tuples are massive? > > Well, what about it? If there are just a few wide tuples, then the peak > memory consumption will depend on how many of those happen to be in memory > at the same time ... but we have zero control over that in the merge > phase, so why sweat about it here? I think Heikki's got a good idea about > setting a lower bound on the number of tuples we'll hold in memory during > run creation. We don't have control over it, but I'm not excited about specifically going out of our way to always use more memory in dumptuples() because it's no worse than the worst case for merging. I am supportive of the idea of making sure that the amount of memory left over for tuples is reasonably in line with memtupsize at the point that the sort starts, though. -- Peter Geoghegan