Обсуждение: pgsql: Change the way pre-reading in external sort's merge phase works.

Поиск
Список
Период
Сортировка

pgsql: Change the way pre-reading in external sort's merge phase works.

От
Heikki Linnakangas
Дата:
Change the way pre-reading in external sort's merge phase works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we
do an external sort.

Reviewed by Peter Geoghegan and Claudio Freire.

Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/e94568ecc10f2638e542ae34f2990b821bbf90ac

Modified Files
--------------
src/backend/utils/sort/logtape.c   |  153 ++++-
src/backend/utils/sort/tuplesort.c | 1216 +++++++++++++-----------------------
src/include/utils/logtape.h        |    2 +
3 files changed, 574 insertions(+), 797 deletions(-)


Re: pgsql: Change the way pre-reading in external sort's merge phase works.

От
Peter Geoghegan
Дата:
On Mon, Oct 3, 2016 at 3:38 AM, Heikki Linnakangas
<heikki.linnakangas@iki.fi> wrote:
> Change the way pre-reading in external sort's merge phase works.

I noticed a simple oversight in this patch. It looks like you missed
one place where state->maxTapes ought to be replaced with
numInputTapes -- the loop that calls LogicalTapeAssignReadBufferSize()
needs that changed too, in order to continue to respect workMem as a
budget. There is a lesson for me, here: Start running tests of sorting
patches only after setting "sysctl -w vm.overcommit_memory=2", because
over commit can obscure things. Doing that as standard procedure for
testing would have allowed me to catch this immediately.

Maybe you should do the same with the other loop that iterates based
on a state->maxTapes invariant that was added to the end of
mergeruns() by this commit (use numInputTapes there in place of
state->maxTapes, too). And, I wonder if it's worth it to not actually
rewind in that loop at all -- perhaps you could instead call a new
logtape.c function that just frees preload buffer memory. You'd also
call this new simple "free preload buffer" function in place of the
LogicalTapeRewind() call within tuplesort_gettuple_common(), of
course.

I've found that I have to do this within the rebased parallel CREATE
INDEX patch anyway, since LogicalTapeRewind() has various irrelevant
pre and post conditions that don't interact well with tape unification
(so you get assertion failures that are probably harmless, but not
really fixable within LogicalTapeRewind() itself). Might be best to
get ahead of that now; my problem with using LogicalTapeRewind()
suggests to me that not using LogicalTapeRewind() to simply free
preload memory could be good *general* future-proofing.

Thanks
--
Peter Geoghegan


Re: pgsql: Change the way pre-reading in external sort's merge phase works.

От
Heikki Linnakangas
Дата:
On 10/06/2016 04:23 AM, Peter Geoghegan wrote:
> I noticed a simple oversight in this patch. It looks like you missed
> one place where state->maxTapes ought to be replaced with
> numInputTapes -- the loop that calls LogicalTapeAssignReadBufferSize()
> needs that changed too, in order to continue to respect workMem as a
> budget.

Nope. See the comment above that loop:

>     /*
>      * Set the buffers for the tapes.
>      *
>      * In a multi-phase merge, the tape that is initially used as an output
>      * tape, will later be rewound and read from, and should also use a large
>      * buffer at that point.  So we must loop up to maxTapes, not just
>      * numInputTapes!
>      *
>      * If there are fewer runs than tapes, we will set the buffer size also
>      * for tapes that will go completely unused, but that's harmless.
>      * LogicalTapeAssignReadBufferSize() doesn't allocate the buffer
>      * immediately, it just sets the size that will be used, when the tape is
>      * rewound for read, and the tape isn't empty.
>      */
>     for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
>     {
>         int64        numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
>
>         LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>                                         numBlocks * BLCKSZ);
>     }

- Heikki