Re: Parallel copy

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Parallel copy
Дата
Msg-id CAA4eK1+AjvU-+tzs5Ng2q94b6cw49gZsTPQMisjJ5iPaVEV8yQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel copy  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Ответы Re: Parallel copy
Список pgsql-hackers
On Wed, Apr 15, 2020 at 1:10 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> Hence, I was trying to think whether we can leverage this idea for
> implementing parallel COPY in PG. We can design an algorithm similar
> to parallel hash-join where the workers pass through different phases.
> 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
> the small stats about each chunk in the shared memory. If the shared
> memory is full, go to phase 2.
> 2. Phase 2 - Allow a single worker to process the stats and decide the
> actual chunk boundaries so that no tuple spans across two different
> chunks. Go to phase 3.
>
> 3. Phase 3 - Each worker picks one adjusted chunk, parse and process
> tuples from the same. Once done with one chunk, it picks the next one
> and so on.
>
> 4. If there are still some unread contents, go back to phase 1.
>
> We can probably use separate workers for phase 1 and phase 3 so that
> they can work concurrently.
>
> Advantages:
> 1. Each worker spends some significant time in each phase. Gets
> benefit of the instruction cache - at least in phase 1.
> 2. It also has the same advantage of parallel hash join - fast workers
> get to work more.
> 3. We can extend this solution for reading data from STDIN. Of course,
> the phase 1 and phase 2 must be performed by the leader process who
> can read from the socket.
>
> Disadvantages:
> 1. Surely doesn't work if we don't have enough shared memory.
> 2. Probably, this approach is just impractical for PG due to certain
> limitations.
>

As I understand this, it needs to parse the lines twice (second time
in phase-3) and till the first two phases are over, we can't start the
tuple processing work which is done in phase-3.  So even if the
tokenization is done a bit faster but we will lose some on processing
the tuples which might not be an overall win and in fact, it can be
worse as compared to the single reader approach being discussed.
Now, if the work done in tokenization is a major (or significant)
portion of the copy then thinking of such a technique might be useful
but that is not the case as seen in the data shared above (the
tokenize time is very less as compared to data processing time) in
this email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Andrey M. Borodin"
Дата:
Сообщение: Re: Allow pg_read_all_stats to read pg_stat_progress_*
Следующее
От: Robert Haas
Дата:
Сообщение: Re: where should I stick that backup?