Обсуждение: How to get a more RSYNC compatible output of pg_dump?
Hi everyone, for various historical reasons I maintain a database containing large file uploads, which makes uncompressed output of pg_dump ~200 GiB in size currently. I'm storing that dump to some NAS and am trying to forward it from there using RSYNC to multiple different additional offsite USB disks. I'm doing the same with the files directory of Postgres already after taking BTRFS snapshots etc. and for those files things work pretty well with RSYNC. Lots of files are skipped entirely, some are slightly updated in-place, some updates are a bit larger depending on the actual changes and when RSYNC executed last etc. Though, with the large dumps it seems to me that with every slight change in the actual data the entire dump gets downloaded again. I'm already using uncompressed dumps in the hope that the output is more stable and RSYNC better able to recognize unchanged parts. But I guess that most changes in the dumped data simply result in all subsequent data being that misplaced compared to what RSYNC reads against, that it's like downloading the whole file again in the end. Is that simply the way it is or are there some optimizations possible when using pg_dump? Am using Postgres 11 and don't see anything which seems to help in this use-case. Thanks! Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen jederzeit zur Verfügung. Mit freundlichen Grüßen, Thorsten Schöning Telefon: +49 (0)515 94 68 - 0 Fax: E-Mail: TSchoening@am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH Brandenburger Straße 7c 31789 Hameln Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen und ist ausschliesslich für den Adressatenbestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Wenn Sie nichtder richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender undvernichten Sie diese E-Mail. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung,Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Massnahmen im Vertrauen auferlangte Information untersagt. This e-mail may contain confidential and/or privileged information and is intended solely for the addressee. Access to thisemail by anyone else is unauthorized. If you are not the intended recipient (or have received this e-mail in error) pleasenotify the sender immediately and destroy this e-mail. If you are not the intended recipient, any disclosure, copying,distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Hinweise zum Datenschutz: bitstore.group/datenschutz
Am 16.05.22 um 09:56 schrieb Thorsten Schöning: > Hi everyone, > > for various historical reasons I maintain a database containing large > file uploads, which makes uncompressed output of pg_dump ~200 GiB in > size currently. I'm storing that dump to some NAS and am trying to > forward it from there using RSYNC to multiple different additional > offsite USB disks. > > I'm doing the same with the files directory of Postgres already after > taking BTRFS snapshots etc. and for those files things work pretty > well with RSYNC. Lots of files are skipped entirely, some are slightly > updated in-place, some updates are a bit larger depending on the > actual changes and when RSYNC executed last etc. > > Though, with the large dumps it seems to me that with every slight > change in the actual data the entire dump gets downloaded again. I'm > already using uncompressed dumps in the hope that the output is more > stable and RSYNC better able to recognize unchanged parts. But I guess > that most changes in the dumped data simply result in all subsequent > data being that misplaced compared to what RSYNC reads against, that > it's like downloading the whole file again in the end. > > Is that simply the way it is or are there some optimizations possible > when using pg_dump? Am using Postgres 11 and don't see anything which > seems to help in this use-case. > > Thanks! > > Mit freundlichen Grüßen > > Thorsten Schöning > Hi Thorsten, This is an rsync question, not a pg_dump question. If you want to sync a new version of a file without transferring the whole thing, you have to use the option -c or --checksum. This works well only if some blocks of the file have changed, while most others haven't. This won't be the case of a pg_dump. So I don't see a way of re-syncing the way you expect it to. Regards, Holger -- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения
On Mon, May 16, 2022 at 09:56:34AM +0200, Thorsten Schöning wrote: > Is that simply the way it is or are there some optimizations possible > when using pg_dump? Am using Postgres 11 and don't see anything which > seems to help in this use-case. What I'd suggest: 1. Use -Fd, to have each table separately 2. Use -Z0 to avoid normal compression 3. Use `gzip --rsyncable` on the data, to make the backup smaller for transmit (though you will need to unpack it to restore), or play with other compressors, like xz depesz
Guten Tag Holger Jakobs, am Montag, 16. Mai 2022 um 12:52 schrieben Sie: > If you want to sync a new version of a file without transferring > the whole thing, you have to use the option -c or --checksum. No, --checksum is only for (slowly) recognizing changes at all, transfer is differential by default with or without that option. > -c, --checksum skip based on checksum, not mod-time & size https://linux.die.net/man/1/rsync > This works well only if some blocks of the file have changed, while > most others haven't. This won't be the case of a pg_dump. It is the case for some dumps I'm creating, but those are a lot smaller and don't store large objects like files. Therefore I was wondering about the order in which pg_dump outputs tables, rows etc., e.g. if that is the same always or changing with each exec for some reason. Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen jederzeit zur Verfügung. Mit freundlichen Grüßen, Thorsten Schöning Telefon: +49 (0)515 94 68 - 0 Fax: E-Mail: TSchoening@am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH Brandenburger Straße 7c 31789 Hameln Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen und ist ausschliesslich für den Adressatenbestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Wenn Sie nichtder richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender undvernichten Sie diese E-Mail. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung,Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Massnahmen im Vertrauen auferlangte Information untersagt. This e-mail may contain confidential and/or privileged information and is intended solely for the addressee. Access to thisemail by anyone else is unauthorized. If you are not the intended recipient (or have received this e-mail in error) pleasenotify the sender immediately and destroy this e-mail. If you are not the intended recipient, any disclosure, copying,distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Hinweise zum Datenschutz: bitstore.group/datenschutz
Am 16.05.22 um 14:50 schrieb Thorsten Schöning: > Guten Tag Holger Jakobs, > am Montag, 16. Mai 2022 um 12:52 schrieben Sie: > >> If you want to sync a new version of a file without transferring >> the whole thing, you have to use the option -c or --checksum. > No, --checksum is only for (slowly) recognizing changes at all, > transfer is differential by default with or without that option. Yes, the option causes rsync to also detect changes if the size and timestamps haven't changed. >> -c, --checksum skip based on checksum, not mod-time & size > https://linux.die.net/man/1/rsync > >> This works well only if some blocks of the file have changed, while >> most others haven't. This won't be the case of a pg_dump. > It is the case for some dumps I'm creating, but those are a lot > smaller and don't store large objects like files. Therefore I was > wondering about the order in which pg_dump outputs tables, rows etc., > e.g. if that is the same always or changing with each exec for some > reason. > If only one byte close to the beginning of the file is added, this would shift all following bytes by 1, so no block of the file will be identical and have to be re-transferred. Therfore, the only option is creating separate files for each table. This would transfer only those files where the content of the table has been changed since the last transmission. There has been another answer by depesz suggesting that. Regards, Holger -- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения
Guten Tag hubert depesz lubaczewski, am Montag, 16. Mai 2022 um 14:28 schrieben Sie: > 1. Use -Fd, to have each table separately Good hint, but doesn't work (easily) in my context. I'm streaming dumps through SSH to store as plain file or in some contexts even into a named pipe backed up by BorgBackup/BorgMatic. Though, with BorgBackup de-duplication seems to work pretty well, so I might focus on RSYNC. Streaming through SSH using TAR-format and UNTAR locally afterwards with some additional DIR-handling might be an option... Need to think about this, thanks! Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen jederzeit zur Verfügung. Mit freundlichen Grüßen, Thorsten Schöning Telefon: +49 (0)515 94 68 - 0 Fax: E-Mail: TSchoening@am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH Brandenburger Straße 7c 31789 Hameln Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen und ist ausschliesslich für den Adressatenbestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Wenn Sie nichtder richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender undvernichten Sie diese E-Mail. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung,Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Massnahmen im Vertrauen auferlangte Information untersagt. This e-mail may contain confidential and/or privileged information and is intended solely for the addressee. Access to thisemail by anyone else is unauthorized. If you are not the intended recipient (or have received this e-mail in error) pleasenotify the sender immediately and destroy this e-mail. If you are not the intended recipient, any disclosure, copying,distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Hinweise zum Datenschutz: bitstore.group/datenschutz
Guten Tag Thorsten Schöning, am Montag, 16. Mai 2022 um 18:48 schrieben Sie: > Streaming through SSH using TAR-format and UNTAR locally afterwards > with some additional DIR-handling might be an option... Need to think > about this, thanks! Things seem to be pretty easy in the end and by piping one is able to avoid additional local copies as well: > mkdir -p [...] > pushd [...] > ssh [...] | tar --extract --no-seek --overwrite > popd > pg_restore --dbname=test --format=d DB_DIR The only downside of this approach seems to be lots of files in one directory and am not sure currently if extraneous files are correctly deleted. But better having too many than too few files... ;-) Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen jederzeit zur Verfügung. Mit freundlichen Grüßen, Thorsten Schöning Telefon: +49 (0)515 94 68 - 0 Fax: E-Mail: TSchoening@am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH Brandenburger Straße 7c 31789 Hameln Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen und ist ausschliesslich für den Adressatenbestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Wenn Sie nichtder richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender undvernichten Sie diese E-Mail. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung,Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Massnahmen im Vertrauen auferlangte Information untersagt. This e-mail may contain confidential and/or privileged information and is intended solely for the addressee. Access to thisemail by anyone else is unauthorized. If you are not the intended recipient (or have received this e-mail in error) pleasenotify the sender immediately and destroy this e-mail. If you are not the intended recipient, any disclosure, copying,distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Hinweise zum Datenschutz: bitstore.group/datenschutz
Though, with the large dumps it seems to me that with every slight
change in the actual data the entire dump gets downloaded again. I'm
already using uncompressed dumps in the hope that the output is more
stable and RSYNC better able to recognize unchanged parts. But I guess
that most changes in the dumped data simply result in all subsequent
data being that misplaced compared to what RSYNC reads against, that
it's like downloading the whole file again in the end.
For me the rsync differential algorithm on dump files just works with no special preparations.  I think you need to investigate this from the rsync side, not PostgreSQL.  For example, if you change a dump file just by adding or deleting one character near the beginning, how does rsync respond to that? 
Cheers,
Jeff