Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Ashutosh Sharma
Тема	Re: Parallel copy
Дата	15 июня 2020 г. 17:11:03
Msg-id	CAE9k0Pnf+fdcMa1Geq-OQTeJ3sHrWfDJ3cG4nGoPcbahmqNk8A@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data into a partitioned table using COPY command, then the input file would contain tuples for different tables (partitions) unlike the normal table case where all the tuples in the input file would belong to the same table. So, in such a case, how are we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed to which partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10 tuples where the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the leader process accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into some other DSM and assign them to the worker process or we have taken some other approach to handle this scenario?

Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a look into those and will also slowly start looking into the patches as and when I get some time. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Jun 13, 2020 at 9:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1]. See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
>

In the first version, we won't do parallelism for this. Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long. Anyway,
thanks for showing the interest in the patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Bruce Momjian
Дата: 15 июня 2020 г., 17:10:32
Сообщение: Re: create database with template doesn't copy database ACL

Следующее

От: Tom Lane
Дата: 15 июня 2020 г., 17:24:22
Сообщение: Re: factorial of negative numbers

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy

Предыдущее

Следующее