Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Andrew Dunstan
Тема	Re: Parallel copy
Дата	16 февраля 2020 г. 09:51:37
Msg-id	c74e4d42-900c-26a8-df59-13684b154f74@2ndQuadrant.com обсуждение исходный текст
Ответ на	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>) Re: Parallel copy (Mike Blackwell <mike.blackwell@rrd.com>)
Список	pgsql-hackers

Дерево обсуждения

On 2/15/20 7:32 AM, Amit Kapila wrote:
> On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
>> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
>> ...
>>>> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
>>>>
>>> AFAIU, the whole of this in-quote/out-of-quote state is manged inside
>>> CopyReadLineText which will be done by each of the parallel workers,
>>> something on the lines of what Thomas did in his patch [1].
>>> Basically, we need to invent a mechanism to allocate chunks to
>>> individual workers and then the whole processing will be done as we
>>> are doing now except for special handling for partial tuples which I
>>> have explained in my previous email.  Am, I missing something here?
>>>
>> The problem case that I see is the chunk boundary falling in the
>> middle of a quoted field where
>>  - The quote opens in chunk 1
>>  - The quote closes in chunk 2
>>  - There is an EoL character between the start of chunk 2 and the closing quote
>>
>> When the worker processing chunk 2 starts, it believes itself to be in
>> out-of-quote state, so only data between the start of the chunk and
>> the EoL is regarded as belonging to the partial line. From that point
>> on the parsing of the rest of the chunk goes off track.
>>
>> Some of the resulting errors can be avoided by, for instance,
>> requiring a quote to be preceded by a delimiter or EoL. That answer
>> fails when fields end with EoL characters, which happens often enough
>> in the wild.
>>
>> Recovering from an incorrect in-quote/out-of-quote state assumption at
>> the start of parsing a chunk just seems like a hole with no bottom. So
>> it looks to me like it's best done in a single process which can keep
>> track of that state reliably.
>>
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.
>


IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andrew Dunstan
Дата: 16 февраля 2020 г., 09:27:27
Сообщение: Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

Следующее

От: Pavel Stehule
Дата: 16 февраля 2020 г., 17:12:25
Сообщение: plan cache overhead on plpgsql expression

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy

Предыдущее

Следующее