Re: Parallel copy

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: Parallel copy
Дата
Msg-id CANwKhkOhucXnFomwFS+Sas5=69k21J3JbVJuL-BPXAb4RbsREQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel copy  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Ответы Re: Parallel copy
Список pgsql-hackers
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Race condition in SyncRepGetSyncStandbysPriority
Следующее
От: Ahsan Hadi
Дата:
Сообщение: Re: WIP/PoC for parallel backup