Обсуждение: Reliable WAL file shipping over unreliable network
Hello! Let's suppose that a replication master is writting a WAL segment file into a directory. That directory is mounted on the replication slave. Is it possible that the slave will try to read a WAL segment file that is not yet fully flushed to disk? I did not find any requirement in the official documentation about this. How does it work? Do I have to copy segments to temp files, and rename them when they are fully flushed to disk? Or is it okay to have half complete files in the archive dir for a while? Actually I cannot mount the slave's archive directory on the mater, because the network is not reliable. The WAL files need to be copied from the master to the slave. The network can go down in the middle of a copy operation. I can write a program that detects this, and retries the copy when the network comes back. But what happens on the slave side? What will a replication slave do if it sees a half complete WAL file? Is there a utility that is widely used for WAL file shipping, and addresses these problems? I know that PostgreSQL itself does not care about how the log files are shipped, but I suspect that there are robust, proven methods for shipping WAL files over unreliable networks. And finally: if I also enable streaming replication, then it seems that log file shipping is not needed at all. If I omit archive_command and restore_command from the configs, and setup the replication slots and primary_conninfo only, then it seems to be working just fine. But when the network goes down for a while, then the slave goes out of sync and it cannot recover. It was not clear for me from the documentation, but am I right in that I can combine log file shipping with streaming replication, and achieve small replication delays plus the ability to recover after a longer period if network outage? Thanks, Laszlo
Have you tried pglogical? It should work fine for cases as yours, and I know pn first hand that it works well on unstable connections: I have a 18 slave pglogical cluster over bad connections, and it works pretty well.
Regards,
Alvaro Aguayo
Jefe de Operaciones
Open Comb Systems E.I.R.L.
Oficina: (+51-1) 3377813 | RPM: #034252 / (+51) 995540103 | RPC: (+51) 954183248
Website: www.ocs.pe
Sent from my Sony Xperia™ smartphone
---- Nagy László Zsolt wrote ----
Regards,
Alvaro Aguayo
Jefe de Operaciones
Open Comb Systems E.I.R.L.
Oficina: (+51-1) 3377813 | RPM: #034252 / (+51) 995540103 | RPC: (+51) 954183248
Website: www.ocs.pe
Sent from my Sony Xperia™ smartphone
---- Nagy László Zsolt wrote ----
Hello!
Let's suppose that a replication master is writting a WAL segment file
into a directory. That directory is mounted on the replication slave. Is
it possible that the slave will try to read a WAL segment file that is
not yet fully flushed to disk? I did not find any requirement in the
official documentation about this. How does it work? Do I have to copy
segments to temp files, and rename them when they are fully flushed to
disk? Or is it okay to have half complete files in the archive dir for a
while?
Actually I cannot mount the slave's archive directory on the mater,
because the network is not reliable. The WAL files need to be copied
from the master to the slave. The network can go down in the middle of a
copy operation. I can write a program that detects this, and retries the
copy when the network comes back. But what happens on the slave side?
What will a replication slave do if it sees a half complete WAL file?
Is there a utility that is widely used for WAL file shipping, and
addresses these problems? I know that PostgreSQL itself does not care
about how the log files are shipped, but I suspect that there are
robust, proven methods for shipping WAL files over unreliable networks.
And finally: if I also enable streaming replication, then it seems that
log file shipping is not needed at all. If I omit archive_command and
restore_command from the configs, and setup the replication slots and
primary_conninfo only, then it seems to be working just fine. But when
the network goes down for a while, then the slave goes out of sync and
it cannot recover. It was not clear for me from the documentation, but
am I right in that I can combine log file shipping with streaming
replication, and achieve small replication delays plus the ability to
recover after a longer period if network outage?
Thanks,
Laszlo
On Feb 28, 2018, at 7:57 AM, Nagy László Zsolt <gandalf@shopzeus.com> wrote: > > Is there a utility that is widely used for WAL file shipping, and > addresses these problems? rsync -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
On 2/28/18 9:57 AM, Nagy László Zsolt wrote: > but > am I right in that I can combine log file shipping with streaming > replication, and achieve small replication delays plus the ability to > recover after a longer period if network outage? > yes. You might look at adjusting your wal_keep_segments to retain WAL for a period longer than expected network failures. This might require some storage planning/adjustments.
>> Is there a utility that is widely used for WAL file shipping, and >> addresses these problems? > rsync Should I use rsync with --remove-source-files option? Will that solve all problems with unreliable networks?
>>> Is there a utility that is widely used for WAL file shipping, and >>> addresses these problems? >> rsync > Should I use rsync with --remove-source-files option? Will that solve > all problems with unreliable networks? Another problem with rsync might be that it does not know if a WAL file on the source side is growing (e.g. being written) and it might start copy that before it is fully flushed to disk. I see this as a big problem, but I don't have experience. My first idea to address this is to execute an archive command something like this: test ! -f /pg_wal/%f && cp %p /volume/pg_wal_tmp/%f && mv /volume/pg_wal_tmp/%f /volume/pg_wal/%f E.g. move the file into the archive dir only after it has been fully copied to the destination volume. But then we still have the problem of partially written files on the slave side. I can come up with other ideas, but I do not want to reinvent the wheel. Can somebody please tell me how to deal with partially written WAL files? Thanks, Laszlo
On Feb 28, 2018, at 8:46 AM, Nagy László Zsolt <gandalf@shopzeus.com> wrote: > > Another problem with rsync might be that it does not know if a WAL file > on the source side is growing (e.g. being written) and it might start > copy that before it is fully flushed to disk. I see this as a big > problem, but I don't have experience. You use rsync in the archive command, not by itself. Postgres does not try to copy the file until it is ready. > E.g. move the file into the archive dir only after it has been fully > copied to the destination volume. But then we still have the problem of > partially written files on the slave side. No, you do not have that problem at all. Given an unreliable network, you probably do want to use --partial -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgres manage that.
Here is a snippet from I script I use for archiving. You also want to make your script returns failure or success correctly.
# SSH Command and options
SSH_CMD="ssh -o ServerAliveInterval=20 $ARCH_SERVER"
STS=3
OUTPUT=$(rsync -ac --rsync-path="mkdir -p $ARCH_DIR && rsync" $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE)
if [ $? == 0 ]; then
STS=0
fi
exit $STS
On Feb 28, 2018, at 11:38 AM, scott ribe <scott_ribe@elevated-dev.com> wrote:On Feb 28, 2018, at 8:46 AM, Nagy László Zsolt <gandalf@shopzeus.com> wrote:
Another problem with rsync might be that it does not know if a WAL file
on the source side is growing (e.g. being written) and it might start
copy that before it is fully flushed to disk. I see this as a big
problem, but I don't have experience.
You use rsync in the archive command, not by itself. Postgres does not try to copy the file until it is ready.E.g. move the file into the archive dir only after it has been fully
copied to the destination volume. But then we still have the problem of
partially written files on the slave side.
No, you do not have that problem at all.
Given an unreliable network, you probably do want to use --partial
--
Scott Ribe
https://www.linkedin.com/in/scottribe/
(303) 722-0567
>> Another problem with rsync might be that it does not know if a WAL file >> on the source side is growing (e.g. being written) and it might start >> copy that before it is fully flushed to disk. I see this as a big >> problem, but I don't have experience. > You use rsync in the archive command, not by itself. Postgres does not try to copy the file until it is ready. Well, that will be a problem because postgres is running in an isolated docker container. But that is a totally different problem. I can see that if rsync is used in the archive command, then postgres will retry unsuccessful transfers on the source side. > >> E.g. move the file into the archive dir only after it has been fully >> copied to the destination volume. But then we still have the problem of >> partially written files on the slave side. > No, you do not have that problem at all. Can you please explain how postres on the replication slave can determine if the file in the archive dir is completely flushed to the disk? > > Given an unreliable network, you probably do want to use --partial > --partial can resume interrupted transfers. But on the slave side, there will be partially transferred files in the archive dir, maybe for a long time. As far as I understand, the restore command will not (and cannot) examine of the source file is completely copied. It will just be called, and if there is a file with the expected name, then it will copy it into pg_wal. Maybe I'm missing a key concept, but I still don't see how it works. L
Thanks for the script. So I need to use this on the master side in archive_command. It ensures that postgres will retry to transfer partially transferred files until it succeeds. Unfortunately, I cannot use this in my docker container which is isolated and does not contain SSH or rsync. But I get the idea and I can come up with a simple script that runs on the host machine and transfers these files reliably to the slave side.Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgres manage that.Here is a snippet from I script I use for archiving. You also want to make your script returns failure or success correctly.# SSH Command and optionsSSH_CMD="ssh -o ServerAliveInterval=20 $ARCH_SERVER"STS=3OUTPUT=$(rsync -ac --rsync-path="mkdir -p $ARCH_DIR && rsync" $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE)if [ $? == 0 ]; thenSTS=0fiexit $STS
But I still don't understand what happens on the slave side when the slave tries to use a partially transferred WAL file. I have this recovery.conf on the slave side:
standby_mode='on'
primary_conninfo='host=postgres-master port=5432 user=not_telling password=not_telling'
trigger_file='/backup/trigger'
restore_command = 'cp /transfer/%f %p'
archive_cleanup_command = 'pg_archivecleanup /transfer %r'
So what happens when the slave postgres executes restore_command on a WAL file that was transferred partially? It cannot test the file for completeness before it was already copied to pg_wal. The restore command itself also cannot tell if the file is complete or not. What does PostgreSQL do when it sees an incomplete file in pg_wal? Detects that the file was incomplete, and tries to execute the restore_command again? When? How often?
It seems inefficient to execute the restore_command multiple times, just to find out that the file was not yet complete, but if this is how it works, then it is fine with me. But does it really work that way? I don't see it documented (or maybe I'm looking at the wrong place.)
Sorry for the many questions. I need to understand every detail of this before we do it in production.
On Feb 28, 2018, at 10:05 AM, Rui DeSousa <rui.desousa@icloud.com> wrote: > > Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgresmanage that. -c has nothing to do with data corruption during transfer -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
Easiest thing to do is to keep the replica from seeing the file until it’s successfully transferred; i.e.
1. Transfer the file successfully first with a temporary extension: $ARCH_DIR/$WALFILE.swap
2. If successfully transferred then move it into place:
mv $ARCH_DIR/$WALFILE.swap $ARCH_DIR/$WALFILE
3. Return status success/failure code back to primary
On Feb 28, 2018, at 12:53 PM, Nagy László Zsolt <gandalf@shopzeus.com> wrote:Thanks for the script. So I need to use this on the master side in archive_command. It ensures that postgres will retry to transfer partially transferred files until it succeeds. Unfortunately, I cannot use this in my docker container which is isolated and does not contain SSH or rsync. But I get the idea and I can come up with a simple script that runs on the host machine and transfers these files reliably to the slave side.Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgres manage that.Here is a snippet from I script I use for archiving. You also want to make your script returns failure or success correctly.# SSH Command and optionsSSH_CMD="ssh -o ServerAliveInterval=20 $ARCH_SERVER"STS=3OUTPUT=$(rsync -ac --rsync-path="mkdir -p $ARCH_DIR && rsync" $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE)if [ $? == 0 ]; thenSTS=0fiexit $STS
But I still don't understand what happens on the slave side when the slave tries to use a partially transferred WAL file. I have this recovery.conf on the slave side:
standby_mode='on'
primary_conninfo='host=postgres-master port=5432 user=not_telling password=not_telling'
trigger_file='/backup/trigger'
restore_command = 'cp /transfer/%f %p'
archive_cleanup_command = 'pg_archivecleanup /transfer %r'
So what happens when the slave postgres executes restore_command on a WAL file that was transferred partially? It cannot test the file for completeness before it was already copied to pg_wal. The restore command itself also cannot tell if the file is complete or not. What does PostgreSQL do when it sees an incomplete file in pg_wal? Detects that the file was incomplete, and tries to execute the restore_command again? When? How often?
It seems inefficient to execute the restore_command multiple times, just to find out that the file was not yet complete, but if this is how it works, then it is fine with me. But does it really work that way? I don't see it documented (or maybe I'm looking at the wrong place.)
Sorry for the many questions. I need to understand every detail of this before we do it in production.
There seem to be 2 fundamental misunderstandings here: 1) That other processes cannot see data written to a file until it is flushed to disk; this is not true; while file datais still in file cache, it is visible to other processes. 2) That rsync writes the file on the destination directly; it does not; it writes into a temporary file and renames thatfile when it is complete. > On Feb 28, 2018, at 10:53 AM, Nagy László Zsolt <gandalf@shopzeus.com> wrote: > > >> >> Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgresmanage that. >> >> Here is a snippet from I script I use for archiving. You also want to make your script returns failure or success correctly. >> >> # SSH Command and options >> SSH_CMD="ssh -o ServerAliveInterval=20 $ARCH_SERVER" >> STS=3 >> >> OUTPUT=$(rsync -ac --rsync-path="mkdir -p $ARCH_DIR && rsync" $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE) >> if [ $? == 0 ]; then >> STS=0 >> fi >> >> exit $STS >> -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
Sure it does… what happens if the archive server gets full? There is a situation that can fool rsync into thinking it wassuccessfully transferred as rsync only relies on filesystem metadata to verify the file is different. The “-c” optionwill validate the file using a checksum and not just rely on just filesystem metadata. Trust me on this one; I’ve seen this happen. > On Feb 28, 2018, at 1:05 PM, scott ribe <scott_ribe@elevated-dev.com> wrote: > > On Feb 28, 2018, at 10:05 AM, Rui DeSousa <rui.desousa@icloud.com> wrote: >> >> Just use "-ac”; you want -c option to ensure no data corruption during the transfer. Do not delete the file; let Postgresmanage that. > > -c has nothing to do with data corruption during transfer > > -- > Scott Ribe > https://www.linkedin.com/in/scottribe/ > (303) 722-0567 >
On Feb 28, 2018, at 1:12 PM, scott ribe <scott_ribe@elevated-dev.com> wrote:There seem to be 2 fundamental misunderstandings here:
1) That other processes cannot see data written to a file until it is flushed to disk; this is not true; while file data is still in file cache, it is visible to other processes.
2) That rsync writes the file on the destination directly; it does not; it writes into a temporary file and renames that file when it is complete.
While you’re correct; I never made either of those assumptions
On Wed, 28 Feb 2018 13:15:48 -0500 Rui DeSousa <rui.desousa@icloud.com> wrote: > Sure it does… what happens if the archive server gets full? There is > a situation that can fool rsync into thinking it was successfully > transferred as rsync only relies on filesystem metadata to verify the > file is different. That's incorrect. The original transfer will fail and rsync will remove the temporary file. It doesn't rename the file to the final destination until and unless the transfer is successful. "-c" is only used for files that might change multiple times per second; since most file systems have only 1-second granualarity in file timestamps, you can't rely on a file being identical if the size and timestamps are the same. > The “-c” option will validate the file using a checksum and not just > rely on just filesystem metadata. That's incorrect. "-c" applies only when rsync looks at existing files to see if they differ. It does not affect what happens when rsync actually copies the file data over; that's always verified with a checksum. Regards, Dianne.
Really? I think you really need to think about it. You are dealing with a race condition. While your though process might seem to work and it will appear to work; however in the end it will fail — like I said; I’veseen it fail first hand. You are dealing with a race condition involving filesystem metadata. If you choose to not use “-c” option then; all I cantell you is that you have been warned. > On Feb 28, 2018, at 1:26 PM, Dianne Skoll <dfs@roaringpenguin.com> wrote: > > On Wed, 28 Feb 2018 13:15:48 -0500 > Rui DeSousa <rui.desousa@icloud.com> wrote: > >> Sure it does… what happens if the archive server gets full? There is >> a situation that can fool rsync into thinking it was successfully >> transferred as rsync only relies on filesystem metadata to verify the >> file is different. > > That's incorrect. The original transfer will fail and rsync will remove > the temporary file. It doesn't rename the file to the final destination > until and unless the transfer is successful. > > "-c" is only used for files that might change multiple times per second; > since most file systems have only 1-second granualarity in file timestamps, > you can't rely on a file being identical if the size and timestamps are the > same. > >> The “-c” option will validate the file using a checksum and not just >> rely on just filesystem metadata. > > That's incorrect. "-c" applies only when rsync looks at existing files > to see if they differ. It does not affect what happens when rsync actually > copies the file data over; that's always verified with a checksum. > > Regards, > > Dianne. >
On Wed, 28 Feb 2018 13:49:33 -0500 Rui DeSousa <rui.desousa@icloud.com> wrote: > Really? I think you really need to think about it. You are dealing > with a race condition. Not at all, assuming rsync is being called by PostgreSQL's archive_command. > You are dealing with a race condition involving filesystem metadata. Nope. Not true. What are the things that are racing? If you rsync data to make a base backup, then yes, use -c... but not because of any race condition. Instead, it's because of the coarse filesystem timestamp granularity. Regards, Dianne.
Number 1 is irrelevant because files are sent from the master and received by the slave. Two different computers that don't share the same cache. Number 2 blows away my obscurity completely, I just didn't know that rsync works that way. It must be documented somewhere in the (very long) rsync manual. I just opened it and I looked specifically for this behaviour, but I cannot find it. There is a --temp-dir option that suggests that data is written to temporary files first. But that is only a suggestion. I don't see anything explicit about writting data to temporary files and renaming them once they are complete. Although it seems logical and I believe you, but I did not want to make such assumptions either.There seem to be 2 fundamental misunderstandings here:
1) That other processes cannot see data written to a file until it is flushed to disk; this is not true; while file data is still in file cache, it is visible to other processes.
2) That rsync writes the file on the destination directly; it does not; it writes into a temporary file and renames that file when it is complete.While you’re correct; I never made either of those assumptions
So, problem solved in theory. The next step is to do tests by simulating network outages.
I also have a proposal: let's change the example in the PostgreSQL documentation! The example archive_command presented in the docs contains a simple cp command that does not write data to temporary files. Anyone who tries to use that example on a production server will shoot himself on the foot. At least we should have a note there, telling that it is the user's responsibility to make sure that only complete WAL files appear on the slave's archive directory, and that the cp command is just a silly example that should never be used on a production server.
Thank you for your help!
Laszlo
On Feb 28, 2018, at 12:03 PM, Nagy László Zsolt <gandalf@shopzeus.com> wrote: > > There is a --temp-dir option that suggests that data is written to temporary files first. But that is only a suggestion.I don't see anything explicit about writting data to temporary files and renaming them once they are complete. Seems hard to me to make it more clear: -T, --temp-dir=DIR This option instructs rsync to use DIR as a scratch directory when creating temporary copies of the files transferred on the receiving side. The default behavior is to create each tempo- rary file in the same directory as the associated destination file. And: --delay-updates This option puts the temporary file from each updated file into a holding directory until the end of the transfer, at which time all the files are renamed into place in rapid succession. This attempts to make the updating of the files a little more atomic. By default the files are placed into a directory named ".~tmp~" in each file's destination directory, but if you've specified the --partial-dir option, that directory will be used instead. See the comments in the --partial-dir section for a discussion of how this ".~tmp~" dir will be excluded from the transfer, and what you can do if you wnat rsync to cleanup old ".~tmp~" dirs that might be lying around. Conflicts with --inplace and --append. -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
2018. 02. 28. 19:11 keltezéssel, Rui DeSousa írta: > Easiest thing to do is to keep the replica from seeing the file until > it’s successfully transferred; i.e. > > 1. Transfer the file successfully first with a temporary extension: > $ARCH_DIR/$WALFILE.swap > 2. If successfully transferred then move it into place: > mv $ARCH_DIR/$WALFILE.swap $ARCH_DIR/$WALFILE This works ONLY if the directory is mounted on the master, and you can execute "mv" on the master (as part of the archive_command) to rename the file that is stored on the slave. But the original question specified that mounting is not an option.
You are dealing with a race condition. Postgres will continually attempt to archive the same file over and over until rsyncfoolishly returns success. You are making the assumption that rsync removed the file? What if it can’t? If you’re dealing with a logged filesystem andit can’t log the metafile changes what happens then? What happened when rsync core dumps? Why would you not use “-c”? I have seen this fail without the “-c” option… > On Feb 28, 2018, at 1:52 PM, Dianne Skoll <dfs@roaringpenguin.com> wrote: > > On Wed, 28 Feb 2018 13:49:33 -0500 > Rui DeSousa <rui.desousa@icloud.com> wrote: > >> Really? I think you really need to think about it. You are dealing >> with a race condition. > > Not at all, assuming rsync is being called by PostgreSQL's archive_command. > >> You are dealing with a race condition involving filesystem metadata. > > Nope. Not true. What are the things that are racing? > > If you rsync data to make a base backup, then yes, use -c... but not because > of any race condition. Instead, it's because of the coarse filesystem > timestamp granularity. > > Regards, > > Dianne. >
So, I decided to run a test. It turns out we are both wrong. My test script is as follows: #====================================================================== #!/bin/bash # Remote machine MACH=dianne.skoll.ca # Create a 10MB pseudo-random file echo "`date` Creating 10MB file" FILE=pseudo-random dd if=/dev/urandom bs=1k count=10k of=$FILE 2>/dev/null # Rsync it over at 500kB per second in the background echo "`date` Starting rsync" rsync -c --verbose --bwlimit=500k $FILE $MACH:/tmp & RSYNC_PID=$! echo "`date` rsync PID=$RSYNC_PID; sleeping for 3 seconds" # Sleep for 3 seconds sleep 3 echo "`date` Stopping rsync" # Stop the rsync kill -STOP $RSYNC_PID echo "`date` Changing source file" # Change the source file dd if=/dev/urandom bs=1k count=10k of=$FILE 2>/dev/null echo "`date` Continuing rsync" # Restart the rsync and bring it into the foreground kill -CONT $RSYNC_PID wait echo "`date` Rsync done; comparing files" ssh $MACH "cd /tmp && sha1sum $FILE" | sha1sum -c - #====================================================================== Running it produces this output: Wed Feb 28 14:14:32 EST 2018 Creating 10MB file Wed Feb 28 14:14:32 EST 2018 Starting rsync Wed Feb 28 14:14:32 EST 2018 rsync PID=24456; sleeping for 3 seconds pseudo-random Wed Feb 28 14:14:35 EST 2018 Stopping rsync Wed Feb 28 14:14:35 EST 2018 Changing source file Wed Feb 28 14:14:35 EST 2018 Continuing rsync sent 10,488,419 bytes received 19,505 bytes 488,740.65 bytes/sec total size is 10,485,760 speedup is 1.00 Wed Feb 28 14:14:53 EST 2018 Rsync done; comparing files pseudo-random: FAILED sha1sum: WARNING: 1 computed checksum did NOT match So even *with* the -c flag, rsync didn't notice the file had been changed while it was running. Interesting. Regards, Dianne.
On Wed, 28 Feb 2018 14:12:39 -0500 Rui DeSousa <rui.desousa@icloud.com> wrote: > You are dealing with a race condition. Postgres will continually > attempt to archive the same file over and over until rsync foolishly > returns success. What do you mean by "rsync foolishly returns success"? > You are making the assumption that rsync removed the file? Rsync does not create the file with the final name until it has successfully transferred it completely. It copies it to a temporary file and then if (and only if) the transfer succeeded, then it atomically renames the temporary file to the final filename. The only failure mode if metadata updates are lost is that the file will appear not to exist when it really should, not the other way around. You can never, ever have the file existing with the proper filename until a transfer has succeeded. However, see my other post about a test. The -c option doesn't actually help if a file changes while rsync is running. Regards, Dianne.
> By default the files are placed into a directory named ".~tmp~" > in each file's destination directory, but if you've specified > the --partial-dir option, that directory will be used instead. > See the comments in the --partial-dir section for a discussion > of how this ".~tmp~" dir will be excluded from the transfer, and > what you can do if you wnat rsync to cleanup old ".~tmp~" dirs > that might be lying around. Conflicts with --inplace and > --append. I stand corrected. It is there. But still: this is a general behaviour of rsync that is independent of most command line options. And yet, "the temporary file" is just an unexplained term that appears in the middle of the description of a (rarely used) option, and the "renamed only after the transfer has completed" part is completely missing. That said, it is just an excuse on my side. I should have figured it out. :-)
On Wed, 28 Feb 2018 13:49:33 -0500 Rui DeSousa <rui.desousa@icloud.com> wrote: > Really? I think you really need to think about it. You are dealing > with a race condition. You are not correct. 1) rsync copies the source file to a temporary file on the destination. 2) If and only if step (1) is successful, rsync renames the temporary file to the final filename. 3) If and only if step (2) is successful, rsync sets the timestamp on the destination file to match that of the source file, assuming the --archive or similar command-line option was given. There is no possible sequence of events that could give you a destination file with the same size and timestamp as the source file and yet be incorrect (just so long as the source file has stayed the same.) Regards, Dianne.
Nagy László Zsolt wrote: > Let's suppose that a replication master is writting a WAL segment file > into a directory. That directory is mounted on the replication slave. Is > it possible that the slave will try to read a WAL segment file that is > not yet fully flushed to disk? I did not find any requirement in the > official documentation about this. How does it work? Do I have to copy > segments to temp files, and rename them when they are fully flushed to > disk? Or is it okay to have half complete files in the archive dir for a > while? I suppose you are talking about "archive_command" here. If the file restored with "restore_command" is too small, the operation fails, and you get a DEBUG1 message: archive file "..." has wrong size: ... instead of ... So nothing can go wrong there. > And finally: if I also enable streaming replication, then it seems that > log file shipping is not needed at all. If I omit archive_command and > restore_command from the configs, and setup the replication slots and > primary_conninfo only, then it seems to be working just fine. But when > the network goes down for a while, then the slave goes out of sync and > it cannot recover. It was not clear for me from the documentation, but > am I right in that I can combine log file shipping with streaming > replication, and achieve small replication delays plus the ability to > recover after a longer period if network outage? If you use a replication slot, the standby will never get out of sync because the primary will retain all WAL that the standby has not received yet. Streaming replication together with archive recovery is only useful if you are *not* using replication slots. Yours, Laurenz Albe
>> Do I have to copy >> segments to temp files, and rename them when they are fully flushed to >> disk? Or is it okay to have half complete files in the archive dir for a >> while? > I suppose you are talking about "archive_command" here. > > If the file restored with "restore_command" is too small, > the operation fails, and you get a DEBUG1 message: > > archive file "..." has wrong size: ... instead of ... > > So nothing can go wrong there. Nothing can go wrong? Does it mean that PostgreSQL will re-execute the restore_command if the file was too small? If it won't retry the restore_command, then everything goes wrong. It might be documented somewhere, but apparently I'm out of luck with documentations. > >> And finally: if I also enable streaming replication, then it seems that >> log file shipping is not needed at all. If I omit archive_command and >> restore_command from the configs, and setup the replication slots and >> primary_conninfo only, then it seems to be working just fine. But when >> the network goes down for a while, then the slave goes out of sync and >> it cannot recover. It was not clear for me from the documentation, but >> am I right in that I can combine log file shipping with streaming >> replication, and achieve small replication delays plus the ability to >> recover after a longer period if network outage? > If you use a replication slot, the standby will never get out of sync > because the primary will retain all WAL that the standby has not > received yet. > > Streaming replication together with archive recovery is only useful > if you are *not* using replication slots. So you are saying that if I use replication slots, then I can completely forget about manual WAL file shipping. There is one thing in the docs that contradicts the above statement. This is from https://www.postgresql.org/docs/10/static/warm-standby.html#STREAMING-REPLICATION > If you use streaming replication without file-based continuous > archiving, the server might recycle old WAL segments before the > standby has received them. If this occurs, the standby will need to be > reinitialized from a new base backup. In other words, using file based archive recovery plus streaming replication at the same time can be useful, because it allows you you to store WAL segments on a disk that is independent of the master, and it makes recovery possible after a longer period of time. Increasing wal_keep_segments may help the replication slave to recover after a longer time, but it is more risky. The segments are stored on the master, and they don't provide protection against disk failures (on the master). Cheers, Laszlo
Nagy László Zsolt wrote: > > > Do I have to copy > > > segments to temp files, and rename them when they are fully flushed to > > > disk? Or is it okay to have half complete files in the archive dir for a > > > while? > > > > I suppose you are talking about "archive_command" here. > > > > If the file restored with "restore_command" is too small, > > the operation fails, and you get a DEBUG1 message: > > > > archive file "..." has wrong size: ... instead of ... > > > > So nothing can go wrong there. > > Nothing can go wrong? Does it mean that PostgreSQL will re-execute the > restore_command if the file was too small? If it won't retry the > restore_command, then everything goes wrong. It might be documented > somewhere, but apparently I'm out of luck with documentations. To verify that restore_command won't accept a file that's too short, read the code in in backend/access/transam/xlogarchive.c The behavior for streaming replication with a WAL archive is documented here: https://www.postgresql.org/docs/current/static/warm-standby.html#STANDBY-SERVER-OPERATION "At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_wal directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_wal. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_wal, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file." > > > And finally: if I also enable streaming replication, then it seems that > > > log file shipping is not needed at all. If I omit archive_command and > > > restore_command from the configs, and setup the replication slots and > > > primary_conninfo only, then it seems to be working just fine. But when > > > the network goes down for a while, then the slave goes out of sync and > > > it cannot recover. It was not clear for me from the documentation, but > > > am I right in that I can combine log file shipping with streaming > > > replication, and achieve small replication delays plus the ability to > > > recover after a longer period if network outage? > > > > If you use a replication slot, the standby will never get out of sync > > because the primary will retain all WAL that the standby has not > > received yet. > > > > Streaming replication together with archive recovery is only useful > > if you are *not* using replication slots. > > So you are saying that if I use replication slots, then I can completely > forget about manual WAL file shipping. Precisely. > There is one thing in the docs > that contradicts the above statement. > > This is from > https://www.postgresql.org/docs/10/static/warm-standby.html#STREAMING-REPLICATION > > > If you use streaming replication without file-based continuous > > archiving, the server might recycle old WAL segments before the > > standby has received them. If this occurs, the standby will need to be > > reinitialized from a new base backup. This only talks about the case where you do not use replication slots. Read https://www.postgresql.org/docs/current/static/warm-standby.html#STREAMING-REPLICATION-SLOTS: "Replication slots provide an automated way to ensure that the master does not remove WAL segments until they have been received by all standbys, and that the master does not remove rows which could cause a recovery conflict even when the standby is disconnected." The key word is "ensure". Yours, Laurenz Albe
I’ve tested this and it seems that there is a still a bug in rsync (rsync version 3.1.2 protocol version 31). I used a1GB archive filesytem to allow for an out of space test case. Not sure of the actual cause as it seems to work a few times;however, it then fails leaving a truncated file and returning a success code. Example: 00000001000000590000003E - failed 4 fimes and on the fifth try rsync returned success and left a truncated file. When there is actually no space left; rsync fails immidately and never returns a success code; i.e. 00000001000000590000003F. When freeing up space; archive resumes again. It seems that if rsync is already syncing when thefiles system fills up then there is a high risk the bug will occur; i.e. 000000010000005A00000001 is also truncated andwith a rsync success code. Since rsync is returning success on a failed sync; even the "-c" option will not help here. Archive Script critical code: OUTPUT=$(rsync -ac $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE) if [ $? == 0 ]; then STS=0 echo "Success: $WALFILE" >> /tmp/waltest.log else echo "Failed: $WALFILE" >> /tmp/waltest.log fi exit $STS Archive Directory (Note: useing 64MB WALs): [postgres@hades ~/arch/dbc1/wal]$ ls -al total 1044351 drwxr-xr-x 2 postgres postgres 57 Feb 28 19:46 . drwxr-xr-x 3 postgres postgres 3 Feb 28 17:33 .. -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000000B -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000000C -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000000D -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000000E -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000000F -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000010 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000011 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000012 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000013 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000014 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000015 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000016 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000017 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000018 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000019 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001A -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001B -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001C -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001D -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001E -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000001F -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000020 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000021 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000022 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000023 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000024 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000025 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000026 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000027 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000028 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000029 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002A -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002B -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002C -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002D -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002E -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000002F -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000030 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000031 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000032 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000033 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000034 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000035 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000036 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000037 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000038 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005900000039 -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000003A -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000003B -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000003C -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000003D -rw------- 1 postgres postgres 3670016 Feb 28 17:33 00000001000000590000003E -rw------- 1 postgres postgres 67108864 Feb 28 17:33 00000001000000590000003F -rw------- 1 postgres postgres 67108864 Feb 28 17:33 000000010000005A00000000 -rw------- 1 postgres postgres 12713984 Feb 28 17:33 000000010000005A00000001 Success/Failure Log: [postgres@hades ~/arch/dbc1/wal]$ cat /tmp/waltest.log Success: 000000010000005900000008 Success: 000000010000005900000009 Success: 00000001000000590000000A Success: 00000001000000590000000B Success: 00000001000000590000000C Success: 00000001000000590000000D Success: 00000001000000590000000E Success: 00000001000000590000000F Success: 000000010000005900000010 Success: 000000010000005900000011 Success: 000000010000005900000012 Success: 000000010000005900000013 Success: 000000010000005900000014 Success: 000000010000005900000015 Success: 000000010000005900000016 Success: 000000010000005900000017 Success: 000000010000005900000018 Success: 000000010000005900000019 Success: 00000001000000590000001A Success: 00000001000000590000001B Success: 00000001000000590000001C Success: 00000001000000590000001D Success: 00000001000000590000001E Success: 00000001000000590000001F Success: 000000010000005900000020 Success: 000000010000005900000021 Success: 000000010000005900000022 Success: 000000010000005900000023 Success: 000000010000005900000024 Success: 000000010000005900000025 Success: 000000010000005900000026 Success: 000000010000005900000027 Success: 000000010000005900000028 Success: 000000010000005900000029 Success: 00000001000000590000002A Success: 00000001000000590000002B Success: 00000001000000590000002C Success: 00000001000000590000002D Success: 00000001000000590000002E Success: 00000001000000590000002F Success: 000000010000005900000030 Success: 000000010000005900000031 Success: 000000010000005900000032 Success: 000000010000005900000033 Success: 000000010000005900000034 Success: 000000010000005900000035 Success: 000000010000005900000036 Success: 000000010000005900000037 Success: 000000010000005900000038 Success: 000000010000005900000039 Success: 00000001000000590000003A Success: 00000001000000590000003B Success: 00000001000000590000003C Success: 00000001000000590000003D Failed: 00000001000000590000003E Failed: 00000001000000590000003E Failed: 00000001000000590000003E Failed: 00000001000000590000003E Success: 00000001000000590000003E Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Failed: 00000001000000590000003F Success: 00000001000000590000003F Success: 000000010000005A00000000 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Failed: 000000010000005A00000001 Success: 000000010000005A00000001 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002 Failed: 000000010000005A00000002
Wow, that's good sleuthing! Are you going to report the bug to the rsync maintainers? You can work around it by not having your script trust rsync, but to do a sha1 hash on both ends after rsync exits. We dothis in production because we once encountered a server with bad memory that was mysteriously corrupting the files; weended up never removing the sha1-comparison code. Regards, Dianne. On February 28, 2018 8:10:59 PM EST, Rui DeSousa <rui.desousa@icloud.com> wrote: > >I’ve tested this and it seems that there is a still a bug in rsync >(rsync version 3.1.2 protocol version 31). I used a 1GB archive >filesytem to allow for an out of space test case. Not sure of the >actual cause as it seems to work a few times; however, it then fails >leaving a truncated file and returning a success code.
Hi On 01/03/18 14:10, Rui DeSousa wrote: > I’ve tested this and it seems that there is a still a bug in rsync (rsync version 3.1.2 protocol version 31). I useda 1GB archive filesytem to allow for an out of space test case. Not sure of the actual cause as it seems to work a fewtimes; however, it then fails leaving a truncated file and returning a success code. > > > Archive Script critical code: > > OUTPUT=$(rsync -ac $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE) > if [ $? == 0 ]; then > STS=0 > echo "Success: $WALFILE" >> /tmp/waltest.log > else > echo "Failed: $WALFILE" >> /tmp/waltest.log > fi > > exit $STS > > > Looking at the above snippet, $STS is not properly initialized in the failure section of the code...so you might be getting false 0 returns when rsync is actually setting some other value. regards Mark
STS is initialized as 3; and is only set to zero if rsync returns zero. > On Feb 28, 2018, at 8:59 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > Hi > > On 01/03/18 14:10, Rui DeSousa wrote: >> I’ve tested this and it seems that there is a still a bug in rsync (rsync version 3.1.2 protocol version 31). I useda 1GB archive filesytem to allow for an out of space test case. Not sure of the actual cause as it seems to work a fewtimes; however, it then fails leaving a truncated file and returning a success code. >> >> >> Archive Script critical code: >> >> OUTPUT=$(rsync -ac $XLOGFILE $ARCH_SERVER:$ARCH_DIR/$WALFILE) >> if [ $? == 0 ]; then >> STS=0 >> echo "Success: $WALFILE" >> /tmp/waltest.log >> else >> echo "Failed: $WALFILE" >> /tmp/waltest.log >> fi >> >> exit $STS >> >> >> > > Looking at the above snippet, $STS is not properly initialized in the failure section of the code...so you might be gettingfalse 0 returns when rsync is actually setting some other value. > > regards > > Mark
Hi On 01/03/18 15:27, Rui DeSousa wrote: > STS is initialized as 3; and is only set to zero if rsync returns zero. > > Ok, so if that works set correctly then another possibility is the archive filesystem (what is that BTW?) not handling out of space nicely/properly. I'm thinking this because doing some testing against a small ext4 filesystem finds rsync reliably returning 11 when trying to copy a file that will not fit (version 3.1.2). regards Mark
It looks like rsync is not handling the disk quota when actually syncing the file — Disc quota exceeded (69) vs no spaceleft (28). It only fails if it can’t rename the truncated temp file — and in some cases there is just enough spaceleft to allow the rename to succeed thus leaving a truncated WAL file. I’m using ZFS; and have 1GB quota on the given filesystem. [postgres@hades ~/dbc1/pg_wal]$ rsync -a 000000010000005B00000035 ~/arch/dbc1/wal/000000010000005B00000035 rsync: rename "/usr/home/postgres/arch/dbc1/wal/.000000010000005B00000035.6bjO94" -> "000000010000005B00000035": Disc quotaexceeded (69) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1200) [sender=3.1.2]
On 01/03/18 17:57, Rui DeSousa wrote: > It looks like rsync is not handling the disk quota when actually syncing the file — Disc quota exceeded (69) vs no spaceleft (28). It only fails if it can’t rename the truncated temp file — and in some cases there is just enough spaceleft to allow the rename to succeed thus leaving a truncated WAL file. > > I’m using ZFS; and have 1GB quota on the given filesystem. > > [postgres@hades ~/dbc1/pg_wal]$ rsync -a 000000010000005B00000035 ~/arch/dbc1/wal/000000010000005B00000035 > rsync: rename "/usr/home/postgres/arch/dbc1/wal/.000000010000005B00000035.6bjO94" -> "000000010000005B00000035": Disc quotaexceeded (69) > rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1200) [sender=3.1.2] > > Interesting - so could be rsync not handling the quota error sensibly - or perhaps more likely ZFS and/or quota bugs. If the latter the solution would be not to use these until the devs have fixed them... :-) regards Mark
I can tell you this; it’s not ZFS or it’s quotas. I know that Postgres has no problems with ZFS and quotas; I have seenPostgres development environments run out of quota space on ZFS and Postgres never got corrupted — it would just crashand end up in a crash/restore loop. Just increase the quota and Postgres recovers without issue exiting the crash/restoreloop, cleanup the rouge space usage, and reduce the quota back down. It makes recovering from out of spaceissues very quick.
On Feb 28, 2018, at 10:03 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > Interesting - so could be rsync not handling the quota error sensibly - or perhaps more likely ZFS and/or quota bugs. Ifthe latter the solution would be not to use these until the devs have fixed them... :-) Or realize that if you have less disk space than needed for a consistent replica or archive, you cannot possibly have a consistentreplica or archive. The false report of success is not good, but it's not the root problem. -- Scott Ribe https://www.linkedin.com/in/scottribe/ (303) 722-0567
On 01/03/18 18:18, Rui DeSousa wrote: > I can tell you this; it’s not ZFS or it’s quotas. I know that Postgres has no problems with ZFS and quotas; I have seenPostgres development environments run out of quota space on ZFS and Postgres never got corrupted — it would just crashand end up in a crash/restore loop. Just increase the quota and Postgres recovers without issue exiting the crash/restoreloop, cleanup the rouge space usage, and reduce the quota back down. It makes recovering from out of spaceissues very quick. > > So enabling quotas on the same ext4 filesystem I reliably getting 'quota exceeded' and 11 return code from rsync. This is on Ubuntu 17.10 (not sure what you are on)...but to me it points the finger at a ZFS bug. Also as has just been pointed out - typically people enable monitoring and provision so as to specifically avoid the weird edge cases for out of disk/quota etc. regards Mark
On Mar 1, 2018, at 12:21 AM, scott ribe <scott_ribe@elevated-dev.com> wrote:The false report of success is not good, but it's not the root problem.
A false success if a problem; especially in this use case as the source WAL file will be deleted by Postgres before it was truly successful. While monitoring is nice to avoid the issue it is not a fix for the issue.
I personally cannot recommend the use of rsync in this application for two reasons.
1. It adds no value; it’s a more complex cp command (no bandwidth saved, etc as archive processes a single file at a time).
2. It lies on success/failure — Period.
I have use “cat” longer than I have used rsync to archive WALs. I can say that I’ve lost zero WAL files using cat; I can not say the same for rsync.
The following code is more reliable than rsync and works with across multiple platforms and filesystems without fail.
STS=3
OUTPUT=$(cat $XLOGFILE | $SSH_CMD "(mkdir -p $ARCH_DIR && cat > $ARCH_DIR/$WALFILE.swap) && mv $ARCH_DIR/$WALFILE.swap $ARCH_DIR/$WALFILE")
if [ $? == 0 ]; then
STS=0
fi
exit $STS
On 02/03/18 03:21, Rui DeSousa wrote: > > >> On Mar 1, 2018, at 12:21 AM, scott ribe <scott_ribe@elevated-dev.com >> <mailto:scott_ribe@elevated-dev.com>> wrote: >> >> The false report of success is not good, but it's not the root problem. > > A false success if a problem; especially in this use case as the > source WAL file will be deleted by Postgres before it was truly > successful. While monitoring is nice to avoid the issue it is not a > fix for the issue. > > I personally cannot recommend the use of rsync in this application for > two reasons. > > 1. It adds no value; it’s a more complex cp command (no bandwidth > saved, etc as archive processes a single file at a time). > 2. It lies on success/failure — Period. > > > I have use “cat” longer than I have used rsync to archive WALs. I can > say that I’ve lost zero WAL files using cat; I can not say the same > for rsync. > > The following code is more reliable than rsync and works with across > multiple platforms and filesystems without fail. > > STS=3 > > OUTPUT=$(cat $XLOGFILE | $SSH_CMD "(mkdir -p $ARCH_DIR && cat > > $ARCH_DIR/$WALFILE.swap) && mv $ARCH_DIR/$WALFILE.swap > $ARCH_DIR/$WALFILE") > if [ $? == 0 ]; then > STS=0 > fi > > exit $STS > > If you have a self contained case that demonstrates rsync returning 0 when it has actually failed, then please do get the rsync authors involved in investigating it (I'm sure they would be interested). Now I've been unable to reproduce any cases of bad return codes or zero length files (using rsync based archive command + quotas), however I'm probably not using the same setup as you (and probably a different platform as well). regards Mark
I opened a ticket this morning:
Greetings, * Rui DeSousa (rui.desousa@icloud.com) wrote: > > On Mar 1, 2018, at 12:21 AM, scott ribe <scott_ribe@elevated-dev.com> wrote: > > The false report of success is not good, but it's not the root problem. > > A false success if a problem; especially in this use case as the source WAL file will be deleted by Postgres before itwas truly successful. While monitoring is nice to avoid the issue it is not a fix for the issue. > > I personally cannot recommend the use of rsync in this application for two reasons. There's a bigger problem that I'm amazed hasn't been brought up already- the rsync, copy, cp, whatever, may finish just fine and then the system that the WAL file was copied to crashes and you end up losing a bunch of WAL because none of these methods actually sync's the WAL to disk. No archive_command should ever return success until the WAL is actually written out to permanent storage and sync'd. That's what tools like pgBackRest do and is part of the reason why people really shouldn't try to hack up their own archive_command solution but should be using well tested existing solutions. Thanks! Stephen
Вложения
Not a problem; as you should be archiving your WALs to multiple sites simultaneously. You’re DR site should not rely onany resources stored at the primary site. > On Mar 2, 2018, at 1:09 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Rui DeSousa (rui.desousa@icloud.com) wrote: >>> On Mar 1, 2018, at 12:21 AM, scott ribe <scott_ribe@elevated-dev.com> wrote: >>> The false report of success is not good, but it's not the root problem. >> >> A false success if a problem; especially in this use case as the source WAL file will be deleted by Postgres before itwas truly successful. While monitoring is nice to avoid the issue it is not a fix for the issue. >> >> I personally cannot recommend the use of rsync in this application for two reasons. > > There's a bigger problem that I'm amazed hasn't been brought up already- > the rsync, copy, cp, whatever, may finish just fine and then the system > that the WAL file was copied to crashes and you end up losing a bunch of > WAL because none of these methods actually sync's the WAL to disk. > > No archive_command should ever return success until the WAL is actually > written out to permanent storage and sync'd. That's what tools like > pgBackRest do and is part of the reason why people really shouldn't try > to hack up their own archive_command solution but should be using well > tested existing solutions. > > Thanks! > > Stephen
> > https://bugzilla.samba.org/show_bug.cgi?id=13317 > Wow, I would never have tought that a so simple question ends up in an rsync bugreport.
Greetings, * Rui DeSousa (rui.desousa@icloud.com) wrote: > Not a problem; as you should be archiving your WALs to multiple sites simultaneously. You’re DR site should not rely onany resources stored at the primary site. Please don't top-post on the PG mailing lists. We prefer in-line responses, such as this. I agree that archiving WAL to multiple sites is a good idea, but none of the proposed examples does that or even provides a way to easily and reliably do it, so that doesn't actually address the issue. Thanks! Stephen
Вложения
In that case I suggest reading up on replication slots and pg_receievexlog. Everyone’s tropology and requirements are differentand there are many ways to skin a cat. > On Mar 4, 2018, at 10:06 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Rui DeSousa (rui.desousa@icloud.com) wrote: >> Not a problem; as you should be archiving your WALs to multiple sites simultaneously. You’re DR site should not relyon any resources stored at the primary site. > > Please don't top-post on the PG mailing lists. We prefer in-line > responses, such as this. > > I agree that archiving WAL to multiple sites is a good idea, but none of > the proposed examples does that or even provides a way to easily and > reliably do it, so that doesn't actually address the issue. > > Thanks! > > Stephen
Greetings, * Rui DeSousa (rui.desousa@icloud.com) wrote: > In that case I suggest reading up on replication slots and pg_receievexlog. Everyone’s tropology and requirements aredifferent and there are many ways to skin a cat. Please don't top-post. Using replication slots and pg_receivewal can be a good solution if you're on a version which supports the combination of both. That doesn't mean the proposed archive commands are any good though, just to be clear. An issue with pg_receivewal though is that it's single-threaded. Thanks! Stephen
Вложения
On Mar 5, 2018, at 6:15 AM, Stephen Frost <sfrost@snowman.net> wrote:Using replication slots and pg_receivewal can be a good solution if
you're on a version which supports the combination of both. That
doesn't mean the proposed archive commands are any good though, just to
be clear. An issue with pg_receivewal though is that it's
single-threaded.
I must say this is a bit absurd; I didn’t realize that telling someone not to delete Postgres WAL files from underneath Postgres would require me to provide a complete redundant backup solution.
Pg_receivxlog or the archive_command being single threaded is not an issue; at least not for me and I’m generating 1/2TB of WALs a day. The real problem is that applying the WALs is single threaded — trying to apply a single days worth of WALs takes too long and is one of the reasons I take multiple backups a day to reduce the number of WALs required during a PITR.
The solution you proposed; would not be able to keep up with the rate of backups I issue daily nor is it capable of taking a backups on the replica at least not yet from the presentation I reviewed. I do my backups on replicas at multiple sites with WAL files also being stored at multiple sites. I also do daily restores in a lower environment which take less than 5 minutes to do — obviously I’m making extensive use of snapshots and snapshot replication.
If your archive server is crashing then you have other issues and one should work to remove single points of failure. I’m not sure what filesystem you’re using but the one I use sync to disk every 30 seconds.
The original poster stated that rsync wasn’t even an option and it not even using it. I’m not here trying to push a backup solution or anything else; I was just trying give some simple advice to the given problem.
Greetings, * Rui DeSousa (rui.desousa@icloud.com) wrote: > > On Mar 5, 2018, at 6:15 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Using replication slots and pg_receivewal can be a good solution if > > you're on a version which supports the combination of both. That > > doesn't mean the proposed archive commands are any good though, just to > > be clear. An issue with pg_receivewal though is that it's > > single-threaded. > > I must say this is a bit absurd; I didn’t realize that telling someone not to delete Postgres WAL files from underneathPostgres would require me to provide a complete redundant backup solution. It doesn't- but please don't encourage partial solutions which have very clear issues. > Pg_receivxlog or the archive_command being single threaded is not an issue; at least not for me and I’m generating 1/2TBof WALs a day. The real problem is that applying the WALs is single threaded — trying to apply a single days worthof WALs takes too long and is one of the reasons I take multiple backups a day to reduce the number of WALs requiredduring a PITR. I agree that the single-threaded nature of pg_basebackup and pg_receivewal isn't an issue for all environments, and that it would be good to have a way to replay WAL faster, and that performing frequent incrementals is a way to reduce the amount of WAL to be replayed. > The solution you proposed; would not be able to keep up with the rate of backups I issue daily nor is it capable of takinga backups on the replica at least not yet from the presentation I reviewed. I do my backups on replicas at multiplesites with WAL files also being stored at multiple sites. I also do daily restores in a lower environment whichtake less than 5 minutes to do — obviously I’m making extensive use of snapshots and snapshot replication. The only solution which I mentioned on this thread was pgBackRest, which keeps up very well with very frequent backups, including those which utilize a replica, so I'm not sure what you're referring to here. pgBackRest is also able to perform fast restores when there have been few changes. Note that you had suggested pg_receivewal- I was just commenting on it. Snapshots and snapshot replication can be useful, but it's vital to ensure that the entire snapshot is taken atomically across all tablespaces (unless you're doing both snapshots *and* using pg_start/stop_backup, collecting necessary WAL, and being sure to put the backup_label file into place during restore). > If your archive server is crashing then you have other issues and one should work to remove single points of failure. I’m not sure what filesystem you’re using but the one I use sync to disk every 30 seconds. This thread began with a concern about an unreliable network, I don't think it's unreasonable to consider other failure scenarios. PostgreSQL performs a great deal of work to ensure that data is written out in a manner which can be recovered in the event of a server crash- utilities which are also working with that data should be similairly resiliant. > The original poster stated that rsync wasn’t even an option and it not even using it. I’m not here trying to push a backupsolution or anything else; I was just trying give some simple advice to the given problem. I agree that rsync is a bad approach. Similairly, proposing archive commands which don't do the basics, such as ensure the WAL file has been sync'd to disk before returning success, isn't good. Thanks! Stephen
Вложения
On Mar 5, 2018, at 10:02 AM, Stephen Frost <sfrost@snowman.net> wrote:It doesn’t- but please don't encourage partial solutions which have ver clear issues.
Then problem is there are no good base utilities that is useful with archive_command; unless you’re writing directly to an NFS mount or a tape library. Even Barman recommends rsync with the archive_command; if you are unable to use pg_receivexlog solution. There are countless Postgres documentation out there that recommends use of rsync with the archive_command.
Here a solution that will fsync() file on the other end.
SSH_CMD="ssh -o ServerAliveInterval=20 $ARCH_SERVER"
STS=3
OUTPUT=$(cat $XLOGFILE | $SSH_CMD "(mkdir -p $ARCH_DIR && ~/bin/fwrite $ARCH_DIR/$WALFILE)")
if [ $? == 0 ]; then
STS=0
fi
exit $STS
fwrite code:
#include <sys/stat.h>
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#define BUFSIZE 131072
int
main(int argc, char *argv[])
{
int fd, r, w;
char *buf;
char *name;
if (argc != 2) {
fprintf(stderr, "usage: fwrite [file]\n");
exit(1);
}
if ((buf = malloc(BUFSIZE)) == NULL)
err(1, "malloc");
++argv;
if ((name = (char *) malloc(strlen(*argv) + 10)) == NULL)
err(1, "malloc");
strcat(strcpy(name, *argv), ".XXXXXX");
if ((fd = mkstemp(name)) < 0)
err(1, "mkstemp");
while ((r = read(STDIN_FILENO, buf, BUFSIZE)) > 0)
if ((w = write(fd, buf, r)) == -1)
err(1, "write");
if (r < 0)
err(1, "read");
if (fsync(fd) != 0)
err(1, "fsync");
if (close(fd) != 0)
err(1, "close");
if (rename(name, *argv) != 0)
err(1, "rename");
free(name);
exit(0);
}
On 3/6/18 12:55 PM, Rui DeSousa wrote: > >> On Mar 5, 2018, at 10:02 AM, Stephen Frost <sfrost@snowman.net >> <mailto:sfrost@snowman.net>> wrote: >> >> It doesn’t- but please don't encourage partial solutions which have >> ver clear issues. > > Then problem is there are no good base utilities that is useful with > archive_command; unless you’re writing directly to an NFS mount or a > tape library. It's true that there are no good base utilities for this purpose, not sure why NFS would be an exception, though. > Even Barman recommends rsync with the archive_command; if > you are unable to use pg_receivexlog solution. There are > countless Postgres documentation out there that recommends use of rsync > with the archive_command. Also true, even though rsync has show itself to be a poor tool for database backups. > Here a solution that will fsync() file on the other end. At a quick glance, that looks like it would work. But why go through the trouble when pgBackRest does all of that (and much more), and does not rely on rsync? You are obviously knowledgeable and like to do it yourself, but not everybody wants to develop a solution from scratch, test it, and maintain it. Regards, -- -David david@pgmasters.net
On Mar 6, 2018, at 1:47 PM, David Steele <david@pgmasters.net> wrote:It's true that there are no good base utilities for this purpose, not
sure why NFS would be an exception, though.
NFS file writes are synchronous writes on the NFS server — unless you make them asynchronous which makes for a bad NFS server.
On 3/6/18 1:53 PM, Rui DeSousa wrote: > >> On Mar 6, 2018, at 1:47 PM, David Steele <david@pgmasters.net >> <mailto:david@pgmasters.net>> wrote: >> >> It's true that there are no good base utilities for this purpose, not >> sure why NFS would be an exception, though. > > NFS file writes are synchronous writes on the NFS server — unless you > make them asynchronous which makes for a bad NFS server. In my experience this is common -- because it gives better performance and most people don't understand the risks. -- -David david@pgmasters.net
Greetings, * Rui DeSousa (rui.desousa@icloud.com) wrote: > > On Mar 5, 2018, at 10:02 AM, Stephen Frost <sfrost@snowman.net> wrote: > > It doesn’t- but please don't encourage partial solutions which have ver clear issues. > > Then problem is there are no good base utilities that is useful with archive_command; unless you’re writing directly toan NFS mount or a tape library. Even Barman recommends rsync with the archive_command; if you are unable to use pg_receivexlogsolution. There are countless Postgres documentation out there that recommends use of rsync with the archive_command. > > Here a solution that will fsync() file on the other end. I'd encourage you to consider submitting this to a commitfest to get feedback on it, as a proper patch to add such a utlity to PG and which we could use in our documentation instead of saying "cp", which is pretty terrible. Then again, as David mentioned, there are already utilities which do this properly, even if they're not included in core today. Thanks! Stephen
Вложения
On 3/6/18 2:00 PM, Stephen Frost wrote: > > * Rui DeSousa (rui.desousa@icloud.com) wrote: >>> On Mar 5, 2018, at 10:02 AM, Stephen Frost <sfrost@snowman.net> wrote: >>> It doesn’t- but please don't encourage partial solutions which have ver clear issues. >> >> Then problem is there are no good base utilities that is useful with archive_command; unless you’re writing directly toan NFS mount or a tape library. Even Barman recommends rsync with the archive_command; if you are unable to use pg_receivexlogsolution. There are countless Postgres documentation out there that recommends use of rsync with the archive_command. >> >> Here a solution that will fsync() file on the other end. > > I'd encourage you to consider submitting this to a commitfest to get > feedback on it, as a proper patch to add such a utlity to PG and which > we could use in our documentation instead of saying "cp", which is > pretty terrible. +1. I believe there's been talk of a pgcopy utility before but it never went anywhere. I think it would be worth it just to have a decent example in the docs. -- -David david@pgmasters.net
Вложения
On Tue, Mar 06, 2018 at 02:31:55PM -0500, David Steele wrote: > On 3/6/18 2:00 PM, Stephen Frost wrote: > > [...] > > > > I'd encourage you to consider submitting this to a commitfest to get > > feedback on it, as a proper patch to add such a utlity to PG and which > > we could use in our documentation instead of saying "cp", which is > > pretty terrible. > > +1. I believe there's been talk of a pgcopy utility before but it never > went anywhere. I think it would be worth it just to have a decent > example in the docs. I wrote a (rather trivial, but it has proven useful) utility for a minimalistic "archive_command" a while ago - it is available at https://johannes.truschnigg.info/code/pg_archive_wal_segment-2.0.0/ because I was unhappy with the suboptimal approach that the shell one-liner cited in the docs took. I'm not sure if it will survive a thousand eyeballs' close look, but it has served us well for more than three years now with anywhere between 10000 and 80000 WAL segments archived each day (to a destination filesystem that is then exported as an NFS share). -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: https://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you.
Вложения
On 07/03/18 08:31, David Steele wrote: > On 3/6/18 2:00 PM, Stephen Frost wrote: >> * Rui DeSousa (rui.desousa@icloud.com) wrote: >>>> On Mar 5, 2018, at 10:02 AM, Stephen Frost <sfrost@snowman.net> wrote: >>>> It doesn’t- but please don't encourage partial solutions which have ver clear issues. >>> Then problem is there are no good base utilities that is useful with archive_command; unless you’re writing directlyto an NFS mount or a tape library. Even Barman recommends rsync with the archive_command; if you are unable to usepg_receivexlog solution. There are countless Postgres documentation out there that recommends use of rsync with the archive_command. >>> >>> Here a solution that will fsync() file on the other end. >> I'd encourage you to consider submitting this to a commitfest to get >> feedback on it, as a proper patch to add such a utlity to PG and which >> we could use in our documentation instead of saying "cp", which is >> pretty terrible. > +1. I believe there's been talk of a pgcopy utility before but it never > went anywhere. I think it would be worth it just to have a decent > example in the docs. > +1. I think if there is a strong feeling that the current docs are suggesting things that shouldn't actually be done, then we should provide an alternative in the distribution itself. Re Pgbackrest (and other backup tools that various Postgres companies have written) - while these may be great (Pgbackrest looks pretty good for instance), they don't come with the Postgres package itself, and our docs suggest something else. Hence discussion threads like this one! The added (new) issue that there may be some unusual bug with rsync + quotas and ZFS on BSD (at least) gives a bit more weight to this idea. regards Mark
On 07/03/18 16:18, Mark Kirkwood wrote: > > > The added (new) issue that there may be some unusual bug with rsync + > quotas and ZFS on BSD (at least) gives a bit more weight to this idea. > > > Looking at the progress of said bug (https://bugzilla.samba.org/show_bug.cgi?id=13317), it appears to be: - due interaction of ZFS and quotas - specific to Freebsd - patch has been developed However, code review of rsync shows it needs a patch too (not always checking success of writes). I'll keep an eye out for progress there. regards Mark