Re: Tuplesort merge pre-reading

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Tuplesort merge pre-reading
Дата
Msg-id CAM3SWZTpaem_gQLcjWreM+NGQZz=_H=KVKn3Yr0dJ+0XZiDNuw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Tuplesort merge pre-reading  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Tuplesort merge pre-reading  (Robert Haas <robertmhaas@gmail.com>)
Re: Tuplesort merge pre-reading  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
On Thu, Sep 29, 2016 at 10:49 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Do I have that right? If so, this seems rather awkward. Hmm.
>
>
> How is it awkward?

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, regardless of how much that matters in
practice. One tape could have several runs in memory at once, while
another only has a fraction of a single much larger run. Maybe this is
just the first time that I find myself on the *other* side of a
discussion about an algorithm that seems brute-force compared to what
it might replace, but is actually better overall.   :-)

Now, that could be something that I just need to get over. In any
case, I still think:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.

* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.

* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".

* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:
       case TSS_SORTEDONTAPE:           Assert(forward || state->randomAccess);           Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)

And:
       case TSS_FINALMERGE:           Assert(forward);           Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)

* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.

-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: pg_basebackup stream xlog to tar
Следующее
От: David Steele
Дата:
Сообщение: Re: PATCH: Exclude additional directories in pg_basebackup