Re: refactoring basebackup.c

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: refactoring basebackup.c
Дата
Msg-id CA+TgmoZ8EAxOw_=2tTMFp+VPLiKXbmTOdsB=0=dUTdZy3r_6jg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: refactoring basebackup.c  (Andres Freund <andres@anarazel.de>)
Ответы Re: refactoring basebackup.c  (Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>)
Список pgsql-hackers
On Fri, May 8, 2020 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if there's cases where recursively forwarding like this will
> cause noticable performance effects. The only operation that seems
> frequent enough to potentially be noticable would be "chunks" of the
> file. So perhaps it'd be good to make sure we read in large enough
> chunks?

Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.

> This, to me, indicates that we should measure the progress solely based
> on how much of the "source" data was processed. The overhead of tar, the
> reduction due to compression, shouldn't be included.

I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.

There are some other problems with the current wire protocol, too:

1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.

2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.

I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:

t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.

> I've not though enough about the specifics, but I think it looks like
> it's going roughly in a better direction.

Good to hear.

> One thing I wonder about is how stateful the interface is. Archivers
> will pretty much always track which file is currently open etc. Somehow
> such a repeating state machine seems a bit ugly - but I don't really
> have a better answer.

I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: making update/delete of inheritance trees scale better
Следующее
От: Robert Haas
Дата:
Сообщение: Re: making update/delete of inheritance trees scale better