Re: parallelizing the archiver

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: parallelizing the archiver
Дата
Msg-id CA+TgmoZUd6zBNb+boukVXrGAVgLyU-fPY+6yfiKj6abmNUCvWA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: parallelizing the archiver  (Julien Rouhaud <rjuju123@gmail.com>)
Ответы Re: parallelizing the archiver  (Julien Rouhaud <rjuju123@gmail.com>)
Re: parallelizing the archiver  ("Bossart, Nathan" <bossartn@amazon.com>)
Список pgsql-hackers
On Fri, Sep 10, 2021 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> Those approaches don't really seems mutually exclusive?  In both case
> you will need to internally track the status of each WAL file and
> handle non contiguous file sequences.  In case of parallel commands
> you only need additional knowledge that some commands is already
> working on a file.  Wouldn't it be even better to eventually be able
> launch multiple batches of multiple files rather than a single batch?

Well, I guess I'm not convinced. Perhaps people with more knowledge of
this than I may already know why it's beneficial, but in my experience
commands like 'cp' and 'scp' are usually limited by the speed of I/O,
not the fact that you only have one of them running at once. Running
several at once, again in my experience, is typically not much faster.
On the other hand, scp has a LOT of startup overhead, so it's easy to
see the benefits of batching.

[rhaas pgsql]$ touch x y z
[rhaas pgsql]$ time sh -c 'scp x cthulhu: && scp y cthulhu: && scp z cthulhu:'
x                                             100%  207KB  78.8KB/s   00:02
y                                             100%    0     0.0KB/s   00:00
z                                             100%    0     0.0KB/s   00:00

real 0m9.418s
user 0m0.045s
sys 0m0.071s
[rhaas pgsql]$ time sh -c 'scp x y z cthulhu:'
x                                             100%  207KB 273.1KB/s   00:00
y                                             100%    0     0.0KB/s   00:00
z                                             100%    0     0.0KB/s   00:00

real 0m3.216s
user 0m0.017s
sys 0m0.020s

> If we start with parallelism first, the whole ecosystem could
> immediately benefit from it as is.  To be able to handle multiple
> files in a single command, we would need some way to let the server
> know which files were successfully archived and which files weren't,
> so it requires a different communication approach than the command
> return code.

That is possibly true. I think it might work to just assume that you
have to retry everything if it exits non-zero, but that requires the
archive command to be smart enough to do something sensible if an
identical file is already present in the archive.

> But as I said, I'm not convinced that using the archive_command
> approach for that is the best approach  If I understand correctly,
> most of the backup solutions would prefer to have a daemon being
> launched and use it at a queuing system.  Wouldn't it be better to
> have a new archive_mode, e.g. "daemon", and have postgres responsible
> to (re)start it, and pass information through the daemon's
> stdin/stdout or something like that?

Sure. Actually, I think a background worker would be better than a
separate daemon. Then it could just talk to shared memory directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Aleksander Alekseev
Дата:
Сообщение: Re: Increase value of OUTER_VAR
Следующее
От: Mark Dilger
Дата:
Сообщение: Re: [Patch] ALTER SYSTEM READ ONLY