Re: backup manifests

Поиск
Список
Период
Сортировка
От David Steele
Тема Re: backup manifests
Дата
Msg-id b2b696b4-b11f-e954-c86a-8252a62a2e40@pgmasters.net
обсуждение исходный текст
Ответ на Re: backup manifests  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: backup manifests  (Robert Haas <robertmhaas@gmail.com>)
Re: backup manifests  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi Robert,

On 9/19/19 9:51 AM, Robert Haas wrote:
> On Wed, Sep 18, 2019 at 9:11 PM David Steele <david@pgmasters.net> wrote:
>> Also consider adding the timestamp.
> 
> Sounds reasonable, even if only for the benefit of humans who might
> look at the file.  We can decide later whether to use it for anything
> else (and third-party tools could make different decisions from core).
> I assume we're talking about file mtime here, not file ctime or file
> atime or the time the manifest was generated, but let me know if I'm
> wrong.

In my experience only mtime is useful.

>> Based on my original calculations (which sadly I don't have anymore),
>> the combination of SHA1, size, and file name is *extremely* unlikely to
>> generate a collision.  As in, unlikely to happen before the end of the
>> universe kind of unlikely.  Though, I guess it depends on your
>> expectations for the lifetime of the universe.

> What I'd say is: if
> the probability of getting a collision is demonstrably many orders of
> magnitude less than the probability of the disk writing the block
> incorrectly, then I think we're probably reasonably OK. Somebody might
> differ, which is perhaps a mild point in favor of LSN-based
> approaches, but as a practical matter, if a bad block is a billion
> times more likely to be the result of a disk error than a checksum
> mismatch, then it's a negligible risk.

Agreed.

>> We include the version/sysid of the cluster to avoid mixups.  It's a
>> great extra check on top of references to be sure everything is kosher.
> 
> I don't think it's a good idea to duplicate the information that's
> already in the backup_label. Storing two copies of the same
> information is just an invitation to having to worry about what
> happens if they don't agree.

OK, but now we have backup_label, tablespace_map, 
XXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXX.backup (in the WAL) and now perhaps a 
backup.manifest file.  I feel like we may be drowning in backup info files.

>> I'd
>> recommend JSON for the format since it is so ubiquitous and easily
>> handles escaping which can be gotchas in a home-grown format.  We
>> currently have a format that is a combination of Windows INI and JSON
>> (for human-readability in theory) and we have become painfully aware of
>> escaping issues.  Really, why would you drop files with '=' in their
>> name in PGDATA?  And yet it happens.
> 
> I am not crazy about JSON because it requires that I get a json parser
> into src/common, which I could do, but given the possibly-imminent end
> of the universe, I'm not sure it's the greatest use of time. You're
> right that if we pick an ad-hoc format, we've got to worry about
> escaping, which isn't lovely.

My experience is that JSON is simple to implement and has already dealt 
with escaping and data structure considerations.  A home-grown solution 
will be at least as complex but have the disadvantage of being non-standard.

>>> One thing I'm not quite sure about is where to store the backup
>>> manifest. If you take a base backup in tar format, you get base.tar,
>>> pg_wal.tar (unless -Xnone), and an additional tar file per tablespace.
>>> Does the backup manifest go into base.tar? Get written into a separate
>>> file outside of any tar archive? Something else? And what about a
>>> plain-format backup? I suppose then we should just write the manifest
>>> into the top level of the main data directory, but perhaps someone has
>>> another idea.
>>
>> We do:
>>
>> [backup_label]/
>>      backup.manifest
>>      pg_data/
>>      pg_tblspc/
>>
>> In general, having the manifest easily accessible is ideal.
> 
> That's a fine choice for a tool, but a I'm talking about something
> that is part of the actual backup format supported by PostgreSQL, not
> what a tool might wrap around it. The choice is whether, for a
> tar-format backup, the manifest goes inside a tar file or as a
> separate file. To put that another way, a patch adding backup
> manifests does not get to redesign where pg_basebackup puts anything
> else; it only gets to decide where to put the manifest.

Fair enough.  The point is to make the manifest easily accessible.

I'd keep it in the data directory for file-based backups and as a 
separate file for tar-based backups.  The advantage here is that we can 
pick a file name that becomes reserved which a tool can't do.

Regards,
-- 
-David
david@pgmasters.net



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: pgbench - allow to create partitioned tables
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: dropdb --force