Re: deduplicating backup of multiple pg_dump dumps

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: deduplicating backup of multiple pg_dump dumps
Дата
Msg-id 2666.1517241380@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: deduplicating backup of multiple pg_dump dumps  (Laurenz Albe <laurenz.albe@cybertec.at>)
Список pgsql-admin
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> Egor Duda wrote:
>> I've recently tried to use borg backup (https://borgbackup.readthedocs.io/) to store multiple
>> PostgreSQL database dumps, and encountered a problem. Due to nondeterministic nature of pg_dump it
>> reorders data tables rows on each invocation, which breaks borg backup chunking and deduplication
>> algorithm.
>>
>> This means that each next dump in backup almost never reuses data from previous dumps, and so it's
>> not possible to store multiple database dumps as efficiently as possible.
>>
>> I wonder if there's any way to force pg_dump use some predictable ordering of data rows (for
>> example, by primary key, where possible) to make dumps more uniform, similar to mysqldump
>> --order-by-primary option?

> There is no such option.

> I think you would be better off with physical backups using "pg_basebackup" if you
> want to deduplicate, at least if deduplication is on the block level.

I think the OP is fooling himself.

pg_dump is perfectly deterministic: dump the same DB twice, you'll get
identical outputs.  The only way that the observed row order would vary
so radically from run to run is if there's a great deal of row update
activity in between, causing rows to get relocated in the heap.  If there
is, and assuming that his application isn't so dumb as to be issuing lots
of no-op updates, then the data is changing a lot.  Therefore there aren't
going to be all that many exact duplicate blocks, no matter whether you
define "block" as a physical data block or a group of rows consecutive
in the PK order.  So this doesn't sound like a case where dedup'ing is
going to be very helpful for compressing backups.  Conceivably sorting
the rows would help at the margin, but I doubt it'd help enough to
justify the cost of the sort.

            regards, tom lane


В списке pgsql-admin по дате отправления:

Предыдущее
От: Laurenz Albe
Дата:
Сообщение: Re: deduplicating backup of multiple pg_dump dumps
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Missing Color Preferences