Re: Parallel tuplesort (for parallel B-Tree index creation)

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Parallel tuplesort (for parallel B-Tree index creation)
Дата
Msg-id CAM3SWZS1F4vab6+QeEHLF3A1YHYEChoNULUVCokLixLi5QkzDA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel tuplesort (for parallel B-Tree index creation)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Re: Parallel tuplesort (for parallel B-Tree index creation)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm reviewing patches 1-3 in this series, i.e. those patches that are not
> directly related to parallelism, but are independent improvements to
> merging.

That's fantastic! Thanks!

I'm really glad you're picking those ones up. I feel that I'm far too
dependent on Robert's review for this stuff. That shouldn't be taken
as a statement against Robert -- it's intended as quite the opposite
-- but it's just personally difficult to rely on exactly one other
person for something that I've put so much work into. Robert has been
involved with 100% of all sorting patches I've written, generally with
far less input from anyone else, and at this point, that's really
rather a lot of complex patches.

> Let's begin with patch 1:
>
> On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
>>
>> Cap the number of tapes used by external sorts

> Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I
> understand correctly, that makes the runs shorter than necessary, leading to
> more runs.

Right. Quite simply, whatever you could have used the workMem for
prior to the merge step, now you can't. It's not so bad during the
merge step of a final on-the-fly merge (or, with the 0002-* patch, any
final merge), since you can get a "refund" of unused (though logically
allocated by USEMEM()) tapes to grow memtuples with (other overhead
forms the majority of the refund, though). That still isn't much
consolation to the user, because run generation is typically much more
expensive (we really just refund unused tapes because it's easy).

>> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
>> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
>> This is a conservative estimate of the number of runs at which doing all
>> merging on-the-fly no longer allows greater overlapping of I/O and
>> computation.
>
>
> Hmm. Surely there are cases, so that with > 501 tapes you could do it with
> one merge pass, but now you need two? And that would hurt performance, no?

In theory, yes, that could be true, and not just for my proposed new
cap of 500 for merge order (501 tapes), but for any such cap. I
noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I
just thought that to double that. Way back in 2006, Tom and Simon
talked about a cap too on several occasions, but I think that that was
in the thousands then.

Hundreds of runs are typically quite rare. It isn't that painful to do
a second pass, because the merge process may be more CPU cache
efficient as a result, which tends to be the dominant cost these days
(over and above the extra I/O that an extra pass requires).

This seems like a very familiar situation to me: I pick a
quasi-arbitrary limit or cap for something, and it's not clear that
it's optimal. Everyone more or less recognizes the need for such a
cap, but is uncomfortable about the exact figure chosen, not because
it's objectively bad, but because it's clearly something pulled from
the air, to some degree. It may not make you feel much better about
it, but I should point out that I've read a paper that claims "Modern
servers of the day have hundreds of GB operating memory and tens of TB
storage capacity. Hence, if the sorted data fit the persistent
storage, the first phase will generate hundreds of runs at most." [1].

Feel free to make a counter-proposal for a cap. I'm not attached to
500. I'm mostly worried about blatant waste with very large workMem
sizings. Tens of thousands of tapes is just crazy. The amount of data
that you need to have as input is very large when workMem is big
enough for this new cap to be enforced.

> Why do we reserve the buffer space for all the tapes right at the beginning?
> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
> start a new run, until we reach maxTapes?

No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.

[1] http://ceur-ws.org/Vol-1343/paper8.pdf
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: patch: function xmltable
Следующее
От: Marko Tiikkaja
Дата:
Сообщение: Re: kqueue