Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Поиск

Список

Период

Сортировка

От	Rushabh Lathia
Тема	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Дата	20 сентября 2017 г. 15:32:46
Msg-id	CAGPqQf1rb+dvUtm-8JPzSTZCYSPiZbmW8NCRGxNnqkEviqbRTw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@bowt.ie>)
Ответы	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Список	pgsql-hackers

Дерево обсуждения

On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset. I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.

Thanks Peter.

I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:

Yes, I haven't touched the randomAccess part yet. My initial goal was

to incorporate the BufFileSet api's here.

> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> + Sharedsort *shared = state->shared;
> + int nLaunched = state->nLaunched;
> + int j;
> +
> + Assert(LEADER(state));
> + Assert(nLaunched >= 1);
> + Assert(nLaunched == shared->workersFinished);
> +
> + /*
> + * Create the tapeset from worker tapes, including a leader-owned tape at
> + * the end. Parallel workers are far more expensive than logical tapes,
> + * so the number of tapes allocated here should never be excessive. FIXME
> + */
> + inittapestate(state, nLaunched + 1);
> + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> + state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case.

Right. I just followed your design in the your earlier patches.

It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?

Sorry, I didn't get this part. Are you talking about the your patch changes

into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes

related to ltsUnify() ? If that's the case, I don't think it require with the

BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.

First application for the tuplesort here is CREATE INDEX and that doesn't

need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support

the same.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?

To be frank its too early for me to comment anything in this area. I need

to study this more closely. As an initial goal I was just focused on
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.

> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.

Its 2.96x, more or less similar to your v10. Might be some changes due

to different testing environment.

Thanks

[1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29
--
Peter Geoghegan

Thanks,

Rushabh Lathia

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Ashutosh Bapat
Дата: 20 сентября 2017 г., 15:06:38
Сообщение: Re: [HACKERS] Varying results when using merge joins overpostgres_fdw vs hash joins

Следующее

От: Ashutosh Sharma
Дата: 20 сентября 2017 г., 15:37:36
Сообщение: Re: [HACKERS] Page Scan Mode in Hash Index

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Предыдущее

Следующее