Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: Parallel copy
Дата	2 ноября 2020 г. 10:50:58
Msg-id	72363fde-b8ac-397f-67e1-ed5f74909cd8@iki.fi обсуждение исходный текст
Ответ на	Re: Parallel copy (Heikki Linnakangas <hlinnaka@iki.fi>)
Список	pgsql-hackers

Дерево обсуждения

On 02/11/2020 09:10, Heikki Linnakangas wrote:
> On 02/11/2020 08:14, Amit Kapila wrote:
>> We have discussed both these approaches (a) single producer multiple
>> consumer, and (b) all workers doing the processing as you are saying
>> in the beginning and concluded that (a) is better, see some of the
>> relevant emails [1][2][3].
>>
>> [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
>> [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
>> [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
> 
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads. The alternative that's discussed in that
> thread seems to be something much more fine-grained, where processes
> claim individual lines. I'm not sure though, I didn't fully understand
> the alternative designs.

I read the thread more carefully, and I think Robert had basically the 
right idea here 
(https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com):

> I really think we don't want a single worker in charge of finding
> tuple boundaries for everybody. That adds a lot of unnecessary
> inter-process communication and synchronization. Each process should
> just get the next tuple starting after where the last one ended, and
> then advance the end pointer so that the next process can do the same
> thing. [...]

And here 
(https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com):

> On Thu, Apr 9, 2020 at 2:55 PM Andres Freund
<andres(at)anarazel(dot)de> wrote:
>> I'm fairly certain that we do *not* want to distribute input data
>> between processes on a single tuple basis. Probably not even below
>> a few
hundred kb. If there's any sort of natural clustering in the loaded data
- extremely common, think timestamps - splitting on a granular basis
will make indexing much more expensive. And have a lot more contention.
> 
> That's a fair point. I think the solution ought to be that once any
> process starts finding line endings, it continues until it's grabbed
> at least a certain amount of data for itself. Then it stops and lets
> some other process grab a chunk of data.
Yes! That's pretty close to the design I sketched. I imagined that the 
leader would divide the input into 64 kB blocks, and each block would 
have  few metadata fields, notably the starting position of the first 
line in the block. I think Robert envisioned having a single "next 
starting position" field in shared memory. That works too, and is even 
simpler, so +1 for that.

For some reason, the discussion took a different turn from there, to 
discuss how the line-endings (called "chunks" in the discussion) should 
be represented in shared memory. But none of that is necessary with 
Robert's design.

- Heikki

В списке pgsql-hackers по дате отправления:

Предыдущее

От: David Rowley
Дата: 02 ноября 2020 г., 10:43:54
Сообщение: Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Следующее

От: Thomas Munro
Дата: 02 ноября 2020 г., 10:55:33
Сообщение: Re: Collation versioning

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy

Предыдущее

Следующее