Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: Parallel copy
Дата	23 октября 2020 г. 08:31:09
Msg-id	1ca1b86d-1dcd-2be1-9036-1519166434f6@iki.fi обсуждение исходный текст
Ответ на	Re: Parallel copy (vignesh C <vignesh21@gmail.com>)
Ответы	Re: Parallel copy
Список	pgsql-hackers

Дерево обсуждения

I had a brief look at at this patch. Important work! A couple of first 
impressions:

1. The split between patches 
0002-Framework-for-leader-worker-in-parallel-copy.patch and 
0003-Allow-copy-from-command-to-process-data-from-file.patch is quite 
artificial. All the stuff introduced in the first is unused until the 
second patch is applied. The first patch introduces a forward 
declaration for ParallelCopyData(), but the function only comes in the 
second patch. The comments in the first patch talk about 
LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only 
comes in the second patch. I think these have to merged into one. If you 
want to split it somehow, I'd suggest having a separate patch just to 
move CopyStateData from copy.c to copy.h. The subsequent patch would 
then be easier to read as you could see more easily what's being added 
to CopyStateData. Actually I think it would be better to have a new 
header file, copy_internal.h, to hold CopyStateData and the other 
structs, and keep copy.h as it is.

2. This desperately needs some kind of a high-level overview of how it 
works. What is a leader, what is a worker? Which process does each step 
of COPY processing, like reading from the file/socket, splitting the 
input into lines, handling escapes, calling input functions, and 
updating the heap and indexes? What data structures are used for the 
communication? How does is the work synchronized between the processes? 
There are comments on those individual aspects scattered in the patch, 
but if you're not already familiar with it, you don't know where to 
start. There's some of that in the commit message, but it needs to be 
somewhere in the source code, maybe in a long comment at the top of 
copyparallel.c.

3. I'm surprised there's a separate ParallelCopyLineBoundary struct for 
every input line. Doesn't that incur a lot of synchronization overhead? 
I haven't done any testing, this is just my gut feeling, but I assumed 
you'd work in batches of, say, 100 or 1000 lines each.

- Heikki

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy