Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Ants Aasma
Тема	Re: Parallel copy
Дата	15 апреля 2020 г. 11:45:31
Msg-id	CANwKhkOhucXnFomwFS+Sas5=69k21J3JbVJuL-BPXAb4RbsREQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Parallel copy (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Ответы	Re: Parallel copy
Список	pgsql-hackers

Дерево обсуждения

On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Kyotaro Horiguchi
Дата: 15 апреля 2020 г., 10:33:30
Сообщение: Re: Race condition in SyncRepGetSyncStandbysPriority

Следующее

От: Ahsan Hadi
Дата: 15 апреля 2020 г., 11:49:39
Сообщение: Re: WIP/PoC for parallel backup

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy

Предыдущее

Следующее