Re: where should I stick that backup?

Поиск
Список
Период
Сортировка
От David Steele
Тема Re: where should I stick that backup?
Дата
Msg-id ab891b06-c9be-c77d-84b4-62fc5a869aad@pgmasters.net
обсуждение исходный текст
Ответ на Re: where should I stick that backup?  (Andres Freund <andres@anarazel.de>)
Ответы Re: where should I stick that backup?
Список pgsql-hackers
On 4/12/20 6:37 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-04-12 17:57:05 -0400, David Steele wrote:
>> On 4/12/20 3:17 PM, Andres Freund wrote:
>>> [proposal outline[
>>
>> This is pretty much what pgBackRest does. We call them "local" processes and
>> they do most of the work during backup/restore/archive-get/archive-push.
> 
> Hah. I swear, I didn't look.

I believe you. If you spend enough time thinking about this (and we've 
spent a lot) then I think this is is where you arrive.

>>> The obvious problem with that proposal is that we don't want to
>>> unnecessarily store the incoming data on the system pg_basebackup is
>>> running on, just for the subcommand to get access to them. More on that
>>> in a second.
>>
>> We also implement "remote" processes so the local processes can get data
>> that doesn't happen to be local, i.e. on a remote PostgreSQL cluster.
> 
> What is the interface between those? I.e. do the files have to be
> spooled as a whole locally?

Currently we use SSH to talk to a remote, but we are planning on using 
our own TLS servers in the future. We don't spool anything -- the file 
is streamed from the PostgreSQL server (via remote protocol if needed) 
to the repo (which could also be remote, e.g. S3) without spoolng to 
disk. We have buffers, of course, which are configurable with the 
buffer-size option.

>>> There's various ways we could address the issue for how the subcommand
>>> can access the file data. The most flexible probably would be to rely on
>>> exchanging file descriptors between basebackup and the subprocess (these
>>> days all supported platforms have that, I think).  Alternatively we
>>> could invoke the subcommand before really starting the backup, and ask
>>> how many files it'd like to receive in parallel, and restart the
>>> subcommand with that number of file descriptors open.
>>
>> We don't exchange FDs. Each local is responsible for getting the data from
>> PostgreSQL or the repo based on knowing the data source and a path. For
>> pg_basebackup, however, I'd imagine each local would want a replication
>> connection with the ability to request specific files that were passed to it
>> by the main process.
> 
> I don't like this much. It'll push more complexity into each of the
> "targets" and we can't easily share that complexity. And also, needing
> to request individual files will add a lot of back/forth, and thus
> latency issues. The server would always have to pre-send a list of
> files, we'd have to deal with those files vanishing, etc.

Sure, unless we had a standard interface to "get a file from the 
PostgreSQL cluster", which is what pgBackRest has via the storage interface.

Attached is our implementation for "backupFile". I think it's pretty 
concise considering what it does. Most of it is dedicated to checksum 
deltas and backup resume. The straight copy with filters starts at line 189.

>>> [2] yes, I already hear json. A line deliminated format would have some
>>> advantages though.
>>
>> We use JSON, but each protocol request/response is linefeed-delimited. So
>> for example here's what it looks like when the main process requests a local
>> process to backup a specific file:
>>
>>
{"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"}
>>
>> And the local responds with:
>>
>>
{"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"}
> 
> As long as it's line delimited, I don't really care :)

Agreed.

>> We are considering a move to HTTP since lots of services (e.g. S3, GCS,
>> Azure, etc.) require it (so we implement it) and we're not sure it makes
>> sense to maintain our own protocol format. That said, we'd still prefer to
>> use JSON for our payloads (like GCS) rather than XML (as S3 does).
> 
> I'm not quite sure what you mean here? You mean actual requests for each
> of what currently are lines? If so, that sounds *terrible*.

I know it sounds like a lot, but in practice the local (currently) only 
performs four operations: backup file, restore file, push file to 
archive, get file from archive. In that context a little protocol 
overhead won't be noticed so if it means removing redundant code I'm all 
for it. That said, we have not done this yet -- it's just under 
consideration.

Regards,
-- 
-David
david@pgmasters.net

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Steele
Дата:
Сообщение: Re: cleaning perl code
Следующее
От: Robert Haas
Дата:
Сообщение: Re: where should I stick that backup?