Обсуждение: thinko in basic_archive.c
Hi hackers, I intended for the temporary file name generated by basic_archive.c to include the current timestamp so that the name was "sufficiently unique." Of course, this could also be used to determine the creation time, but you can just as easily use stat(1) for that. In any case, I forgot to divide the microseconds field by 1000 to obtain the current timestamp in milliseconds, so while the value is unique, it's also basically garbage. I've attached a small patch that fixes this so that the temporary file name includes the timestamp in milliseconds for when it was created. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
On Fri, Oct 14, 2022 at 10:11 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> I intended for the temporary file name generated by basic_archive.c to
I'm trying to understand this a bit:
    /*
     * Pick a sufficiently unique name for the temporary file so that a
     * collision is unlikely.  This helps avoid problems in case a temporary
     * file was left around after a crash or another server happens to be
     * archiving to the same directory.
     */
Given that temp file name includes WAL file name, epoch to
milliseconds scale and MyProcPid, can there be name collisions after a
server crash or even when multiple servers with different pids are
archiving/copying the same WAL file to the same directory?
What happens to the left-over temp files after a server crash? Will
they be lying around in the archive directory? I understand that we
can't remove such files because we can't distinguish left-over files
from a crash and the temp files that another server is in the process
of copying.
If the goal is to copy files atomically, why can't we name the temp
file 'wal_file_name.pid.temp', assuming no PID wraparound and get rid
of appending time? Since basic_archive is a test module illustrating
archive_library implementation, do we really need to worry about name
collisions?
> I've attached a small patch that fixes this so that the temporary file name
> includes the timestamp in milliseconds for when it was created.
The patch LGTM.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
			
		On Fri, Oct 14, 2022 at 02:15:19PM +0530, Bharath Rupireddy wrote: > Given that temp file name includes WAL file name, epoch to > milliseconds scale and MyProcPid, can there be name collisions after a > server crash or even when multiple servers with different pids are > archiving/copying the same WAL file to the same directory? While unlikely, I think it's theoretically possible. If there is a collision, archiving should fail and retry later with a different temporary file name. > What happens to the left-over temp files after a server crash? Will > they be lying around in the archive directory? I understand that we > can't remove such files because we can't distinguish left-over files > from a crash and the temp files that another server is in the process > of copying. The temporary files are not automatically removed after a crash. The documentation for basic archive has a note about this [0]. > If the goal is to copy files atomically, why can't we name the temp > file 'wal_file_name.pid.temp', assuming no PID wraparound and get rid > of appending time? Since basic_archive is a test module illustrating > archive_library implementation, do we really need to worry about name > collisions? Yeah, it's debatable how much we care about this for basic_archive. We previously decided that we at least care a little [1], so that's why we have such elaborate temporary file names. If anything, I hope that the presence of this logic causes archive module authors to think about these problems. > The patch LGTM. Thanks! [0] https://www.postgresql.org/docs/devel/basic-archive.html#id-1.11.7.15.6 [1] https://postgr.es/m/CA%2BTgmoaSkSmo22SwJaV%2BycNPoGpxe0JV%3DTcTbh4ip8Cwjr0ULQ%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Oct 15, 2022 at 12:03 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Oct 14, 2022 at 02:15:19PM +0530, Bharath Rupireddy wrote: > > Given that temp file name includes WAL file name, epoch to > > milliseconds scale and MyProcPid, can there be name collisions after a > > server crash or even when multiple servers with different pids are > > archiving/copying the same WAL file to the same directory? > > While unlikely, I think it's theoretically possible. Can you please help me understand how name collisions can happen with temp file names including WAL file name, timestamp to millisecond scale, and PID? Having the timestamp is enough to provide a non-unique temp file name when PID wraparound occurs, right? Am I missing something here? > > What happens to the left-over temp files after a server crash? Will > > they be lying around in the archive directory? I understand that we > > can't remove such files because we can't distinguish left-over files > > from a crash and the temp files that another server is in the process > > of copying. > > The temporary files are not automatically removed after a crash. The > documentation for basic archive has a note about this [0]. Hm, we cannot remove the temp file for all sorts of crashes, but having on_shmem_exit() or before_shmem_exit() or atexit() or any such callback removing it would help us cover some crash scenarios (that exit with proc_exit() or exit()) at least. I think the basic_archive module currently leaves temp files around even when the server is restarted legitimately while copying to or renaming the temp file, no? I can quickly find these exit callbacks deleting the files: atexit(cleanup_directories_atexit); atexit(remove_temp); before_shmem_exit(ReplicationSlotShmemExit, 0); before_shmem_exit(logicalrep_worker_onexit, (Datum) 0); before_shmem_exit(BeforeShmemExit_Files, 0); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, Oct 15, 2022 at 10:19:05AM +0530, Bharath Rupireddy wrote: > Can you please help me understand how name collisions can happen with > temp file names including WAL file name, timestamp to millisecond > scale, and PID? Having the timestamp is enough to provide a non-unique > temp file name when PID wraparound occurs, right? Am I missing > something here? Outside of contrived cases involving multiple servers, inaccurate clocks, PID reuse, etc., it seems unlikely. > Hm, we cannot remove the temp file for all sorts of crashes, but > having on_shmem_exit() or before_shmem_exit() or atexit() or any such > callback removing it would help us cover some crash scenarios (that > exit with proc_exit() or exit()) at least. I think the basic_archive > module currently leaves temp files around even when the server is > restarted legitimately while copying to or renaming the temp file, no? I think the right way to do this would be to add handling for leftover files in the sigsetjmp() block and a shutdown callback (which just sets up a before_shmem_exit callback). While this should ensure those files are cleaned up after an ERROR or FATAL, crashes and unlink() failures could still leave files behind. We'd probably also need to avoid cleaning up the temp file if copy_file() fails because it already exists, as we won't know if it's actually ours. Overall, I suspect it'd be more trouble than it's worth. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Oct 15, 2022 at 02:10:26PM -0700, Nathan Bossart wrote: > On Sat, Oct 15, 2022 at 10:19:05AM +0530, Bharath Rupireddy wrote: >> Can you please help me understand how name collisions can happen with >> temp file names including WAL file name, timestamp to millisecond >> scale, and PID? Having the timestamp is enough to provide a non-unique >> temp file name when PID wraparound occurs, right? Am I missing >> something here? > > Outside of contrived cases involving multiple servers, inaccurate clocks, > PID reuse, etc., it seems unlikely. With a name based on a PID in a world where pid_max can be larger than the default and a timestamp, I would say even more unlikely than what you are implying with unlikely ;p > I think the right way to do this would be to add handling for leftover > files in the sigsetjmp() block and a shutdown callback (which just sets up > a before_shmem_exit callback). While this should ensure those files are > cleaned up after an ERROR or FATAL, crashes and unlink() failures could > still leave files behind. We'd probably also need to avoid cleaning up the > temp file if copy_file() fails because it already exists, as we won't know > if it's actually ours. Overall, I suspect it'd be more trouble than it's > worth. Agreed. My opinion is that we should keep basic_archive as minimalistic as we can: short still useful. It does not have to be perfect, just to fit with what we want it to show, as a reference. Anyway, the maths were wrong, so I have applied the patch of upthread, with an extra pair of parenthesis, a comment where epoch is declared to tell that it is in milliseconds, and a comment in basic_archive's Makefile to mention the reason why we have NO_INSTALLCHECK. -- Michael
Вложения
On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > What happens to the left-over temp files after a server crash? Will > they be lying around in the archive directory? I understand that we > can't remove such files because we can't distinguish left-over files > from a crash and the temp files that another server is in the process > of copying. Yeah, leaving a potentially unbounded number of files around after system crashes seems pretty unfriendly. I'm not sure how to fix that, exactly. We could use a name based on the database system identifier if we thought that we might be archiving from multiple unrelated clusters to the same directory, but presumably the real hazard is a bunch of machines that are doing physical replication among themselves, and will therefore share a system identifier. There might be no better answer than to suggest that temporary files that are "old" should be removed by means external to the database, but that's not an entirely satisfying answer. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Oct 17, 2022 at 6:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > What happens to the left-over temp files after a server crash? Will > > they be lying around in the archive directory? I understand that we > > can't remove such files because we can't distinguish left-over files > > from a crash and the temp files that another server is in the process > > of copying. > > Yeah, leaving a potentially unbounded number of files around after > system crashes seems pretty unfriendly. I'm not sure how to fix that, > exactly. A simple server restart while the basic_archive module is copying to/from temp file would leave the file behind, see[1]. I think we can fix this by defining shutdown_cb for the basic_archive module, like the attached patch. While this helps in most of the crashes, but not all. However, this is better than what we have right now. [1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls 000000010000000000000001 archtemp.000000010000000000000002.2493876.1666091933457 archtemp.000000010000000000000002.2495316.1666091958680 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
+0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Mon, Oct 17, 2022 at 6:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > What happens to the left-over temp files after a server crash? Will > > > they be lying around in the archive directory? I understand that we > > > can't remove such files because we can't distinguish left-over files > > > from a crash and the temp files that another server is in the process > > > of copying. > > > > Yeah, leaving a potentially unbounded number of files around after > > system crashes seems pretty unfriendly. I'm not sure how to fix that, > > exactly. Unbounded number of sequential crash-restarts itself is a more serious problem.. An archive module could clean up them at startup or at the first call but that might be dangerous (since archive directory is I think thought as external resource). Honestly, I'm not sure about a reasonable scenario where simultaneous archivings of a same file is acceptable, though. I feel that we should not allow simultaneous archiving of the same segment by some kind of interlocking. In other words, we might should serialize duplicate archiving of asame file. In another direction, the current code allows duplicate simultaneous copying to temporary files with different names then the latest renaming wins. We reach the almost same result (on Linuxen (or POSIX?)) by unlinking the existing tempfile first then create a new one with the same name then continue. Even if the tempfile were left alone after a crash, that file would be unlinked at the next trial of archiving. But I'm not sure how this can be done on Windows.. In the first place I'm not sure that the latest-unconditionally-wins policy is appropriate or not, though. > A simple server restart while the basic_archive module is copying > to/from temp file would leave the file behind, see[1]. I think we can > fix this by defining shutdown_cb for the basic_archive module, like > the attached patch. While this helps in most of the crashes, but not > all. However, this is better than what we have right now. ShutdownWalRecovery() does the similar thing, but as you say this one covers rather narrow cases than that since RestoreArchiveFile() finally overwrites the left-alone files at the next call for that file. # The patch seems forgetting to clear the tmepfilepath *after* a # successful renaming. And I don't see how the callback is called. > [1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls > 000000010000000000000001 > archtemp.000000010000000000000002.2493876.1666091933457 > archtemp.000000010000000000000002.2495316.1666091958680 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Unbounded number of sequential crash-restarts itself is a more serious > problem.. Agree. The basic_archive module currently leaves temp files around even for normal restarts of the cluster, which is bad IMO. > An archive module could clean up them at startup or at the first call > but that might be dangerous (since archive directory is I think > thought as external resource). The archive module must be responsible for cleaning up the temp file that it creates. One way to do it is in the archive module's shutdown callback, this covers most of the cases, but not all. > Honestly, I'm not sure about a reasonable scenario where simultaneous > archivings of a same file is acceptable, though. I feel that we should > not allow simultaneous archiving of the same segment by some kind of > interlocking. In other words, we might should serialize duplicate > archiving of asame file. In a typical production environment, there's some kind of locking (for instance lease files) that allow/disallow file creation/deletion/writes/reads which guarantees that the same file isn't put into a directory (can be archive location) many times. And as you rightly said archive_directory is something external to postgres and we really can't deal with concurrent writers writing/creating the same files. Even if we somehow try to do it, it makes things complicated. This is true for any PGDATA directories. However, the archive module implementers can choose to define such a locking strategy. > In another direction, the current code allows duplicate simultaneous > copying to temporary files with different names then the latest > renaming wins. We reach the almost same result (on Linuxen (or > POSIX?)) by unlinking the existing tempfile first then create a new > one with the same name then continue. Even if the tempfile were left > alone after a crash, that file would be unlinked at the next trial of > archiving. But I'm not sure how this can be done on Windows.. In the > first place I'm not sure that the latest-unconditionally-wins policy > is appropriate or not, though. We really can't just unlink the temp file because it has pid and timestamp in the filename and it's hard to determine the temp file that we created earlier. As far as the basic_archive module is concerned, we ought to keep it simple. I still think the simplest we can do is to use the basic_archive's shutdown_cb to delete (in most of the cases, but not all) the left-over temp file that the module is dealing with as-of-the-moment and add a note about the users dealing with concurrent writers to the basic_archive.archive_directory like the attached v2 patch. > > A simple server restart while the basic_archive module is copying > > to/from temp file would leave the file behind, see[1]. I think we can > > fix this by defining shutdown_cb for the basic_archive module, like > > the attached patch. While this helps in most of the crashes, but not > > all. However, this is better than what we have right now. > > ShutdownWalRecovery() does the similar thing, but as you say this one > covers rather narrow cases than that since RestoreArchiveFile() > finally overwrites the left-alone files at the next call for that > file. We're using unique temp file names in the basic_archive module so we can't overwrite the same upon restart. > # The patch seems forgetting to clear the tmepfilepath *after* a > # successful renaming. It does so at the beginning of basic_archive_file() which is sufficient. > And I don't see how the callback is called. call_archive_module_shutdown_callback()->basic_archive_shutdown(). Please see the attached v2 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Tue, Oct 18, 2022 at 11:28 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > Yeah, leaving a potentially unbounded number of files around after > > > system crashes seems pretty unfriendly. I'm not sure how to fix that, > > > exactly. > > Unbounded number of sequential crash-restarts itself is a more serious > problem.. They don't have to be sequential. Garbage could accumulate over weeks, months, or years. Anyway, I agree we should hope that the system doesn't crash often, but we cannot prevent the system administrator from removing the power whenever they like. We can however try to reduce the number of database-related things that go wrong if this happens, and I think we should. Bharath's patch seems like it's probably a good idea, and if we can do better, we should. -- Robert Haas EDB: http://www.enterprisedb.com
At Wed, 19 Oct 2022 10:48:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Tue, Oct 18, 2022 at 11:28 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Yeah, leaving a potentially unbounded number of files around after > > > > system crashes seems pretty unfriendly. I'm not sure how to fix that, > > > > exactly. > > > > Unbounded number of sequential crash-restarts itself is a more serious > > problem.. (Sorry, this was just a kidding.) > They don't have to be sequential. Garbage could accumulate over weeks, > months, or years. Sure. Users' archive cleanup facilities don't work if they only handles the files that with legit WAL file names. > Anyway, I agree we should hope that the system doesn't crash often, > but we cannot prevent the system administrator from removing the power > whenever they like. We can however try to reduce the number of > database-related things that go wrong if this happens, and I think we > should. Bharath's patch seems like it's probably a good idea, and if > we can do better, we should. Yeah, I don't deny this, rather agree. So, we should name temporary files so that they are identifiable as garbage unconditionally at the next startup. (They can be being actually active otherwise.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 19 Oct 2022 10:21:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > An archive module could clean up them at startup or at the first call > > but that might be dangerous (since archive directory is I think > > thought as external resource). > > The archive module must be responsible for cleaning up the temp file > that it creates. One way to do it is in the archive module's shutdown > callback, this covers most of the cases, but not all. True. But I agree to Robert that such temporary files should be cleanup-able without needing temporarily-valid knowledge (current file name, in this case). A common strategy for this is to name those files by names that can be identifed as garbage. > > Honestly, I'm not sure about a reasonable scenario where simultaneous > > archivings of a same file is acceptable, though. I feel that we should > > not allow simultaneous archiving of the same segment by some kind of > > interlocking. In other words, we might should serialize duplicate > > archiving of asame file. > > In a typical production environment, there's some kind of locking (for > instance lease files) that allow/disallow file > creation/deletion/writes/reads which guarantees that the same file > isn't put into a directory (can be archive location) many times. And > as you rightly said archive_directory is something external to > postgres and we really can't deal with concurrent writers > writing/creating the same files. Even if we somehow try to do it, it > makes things complicated. This is true for any PGDATA directories. > However, the archive module implementers can choose to define such a > locking strategy. flock() on nfs.. > > In another direction, the current code allows duplicate simultaneous > > copying to temporary files with different names then the latest > > renaming wins. We reach the almost same result (on Linuxen (or > > POSIX?)) by unlinking the existing tempfile first then create a new > > one with the same name then continue. Even if the tempfile were left > > alone after a crash, that file would be unlinked at the next trial of > > archiving. But I'm not sure how this can be done on Windows.. In the > > first place I'm not sure that the latest-unconditionally-wins policy > > is appropriate or not, though. > > We really can't just unlink the temp file because it has pid and > timestamp in the filename and it's hard to determine the temp file > that we created earlier. But since power cut is a typical crash source, we need to identify ruined temporary files and the current naming convention is incomplete in this regard. The worst case I can come up with regardless of feasibility is a multi-standby physical replication set where all hosts share one archive directory. Indeed live and dead temprary files can coexist there. However, I think we can identify truly rotten temp files by inserting host name or cluster name (means cluster_name in postgresql.conf) even in that case. This premise that DBA names every cluster differently, but I think DBAs that is going to configure such a system are required to be very cautious about that kind of aspect. > As far as the basic_archive module is concerned, we ought to keep it > simple. I still think the simplest we can do is to use the > basic_archive's shutdown_cb to delete (in most of the cases, but not (Sorry, my memory was confused at the time. That callback feature already existed.) > all) the left-over temp file that the module is dealing with > as-of-the-moment and add a note about the users dealing with > concurrent writers to the basic_archive.archive_directory like the > attached v2 patch. > > > > A simple server restart while the basic_archive module is copying > > > to/from temp file would leave the file behind, see[1]. I think we can > > > fix this by defining shutdown_cb for the basic_archive module, like > > > the attached patch. While this helps in most of the crashes, but not > > > all. However, this is better than what we have right now. > > > > ShutdownWalRecovery() does the similar thing, but as you say this one > > covers rather narrow cases than that since RestoreArchiveFile() > > finally overwrites the left-alone files at the next call for that > > file. > > We're using unique temp file names in the basic_archive module so we > can't overwrite the same upon restart. Of course, it premised that a cluster uses the same name for a segment. If we insert cluseter_name into the temprary name, a starting cluster can indentify garbage files to clean up. For example if we name them as follows. ARCHTEMP_cluster1_pid_time_<lsn> A starting cluster can clean up all files starts with "archtemp_cluster1_*". (We need to choose the delimiter carefully, though..) > > # The patch seems forgetting to clear the tmepfilepath *after* a > > # successful renaming. > > It does so at the beginning of basic_archive_file() which is sufficient. No. I didn't mean that, If server stops after a successfull durable_rename but before the next call to basic_archive_file_internal, that call back makes false comlaint since that temprary file is actually gone. > > And I don't see how the callback is called. > > call_archive_module_shutdown_callback()->basic_archive_shutdown(). Yeah, sorry for the noise. > Please see the attached v2 patch. +static char tempfilepath[MAXPGPATH + 256]; MAXPGPATH is the maximum length of a file name that PG assumes to be able to handle. Thus extending that length seems wrong. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Oct 20, 2022 at 6:57 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > The archive module must be responsible for cleaning up the temp file > > that it creates. One way to do it is in the archive module's shutdown > > callback, this covers most of the cases, but not all. > > True. But I agree to Robert that such temporary files should be > cleanup-able without needing temporarily-valid knowledge (current file > name, in this case). A common strategy for this is to name those files > by names that can be identifed as garbage. I'm not sure how we can distinguish temp files as garbage based on name. As Robert pointed out upthread, using system identifier may not help as the standbys share the same system identifier and it's possible that they might archive to the same directory. Is there any other way? > But since power cut is a typical crash source, we need to identify > ruined temporary files and the current naming convention is incomplete > in this regard. Please note that basic_archive module creates one temp file at a time to make file copying/moving atomic and it can keep track of the temp file name and delete it using shutdown callback which helps in most of the scenarios. As said upthread, repeated crashes while basic_archive module is atomically copying files around is a big problem in itself and basic_archive module need not worry about it much. > flock() on nfs.. > > The worst case I can come up with regardless of feasibility is a > multi-standby physical replication set where all hosts share one > archive directory. Indeed live and dead temprary files can coexist > there. However, I think we can identify truly rotten temp files by > inserting host name or cluster name (means cluster_name in > postgresql.conf) even in that case. This premise that DBA names every > cluster differently, but I think DBAs that is going to configure such > a system are required to be very cautious about that kind of aspect. Well, these ideas are great! However, we can leave defining such strategies to archive module implementors. IMO, the basich_archive module ought to be as simple and elegant as possible yet showing up the usability of archive modules feature. > Of course, it premised that a cluster uses the same name for a > segment. If we insert cluseter_name into the temprary name, a starting > cluster can indentify garbage files to clean up. For example if we > name them as follows. > > ARCHTEMP_cluster1_pid_time_<lsn> > > A starting cluster can clean up all files starts with > "archtemp_cluster1_*". (We need to choose the delimiter carefully, > though..) Postgres cleaning up basic_archive modules temp files at the start up isn't a great idea IMO. Because these files are not related to server functionality in any way unlike temp files removed in RemovePgTempFiles(). IMO, we ought to keep the basic_archive module simple. . > No. I didn't mean that, If server stops after a successfull > durable_rename but before the next call to > basic_archive_file_internal, that call back makes false comlaint since > that temprary file is actually gone. Right. Fixed it. > > Please see the attached v2 patch. > > +static char tempfilepath[MAXPGPATH + 256]; > > MAXPGPATH is the maximum length of a file name that PG assumes to be > able to handle. Thus extending that length seems wrong. I think it was to accommodate the temp file name - "archtemp", file, MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However, most of the places the core defines the path name to be MAXPGPATH + some bytes. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi. Anyway, on second thought, lager picture than just adding the post-process-end callback would out of the scope of this patch. So I write some comments on the patch first, then discussion the rest. Thu, 20 Oct 2022 13:29:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > No. I didn't mean that, If server stops after a successfull > > durable_rename but before the next call to > > basic_archive_file_internal, that call back makes false comlaint since > > that temprary file is actually gone. > > Right. Fixed it. Thanks, but we don't need to wipe out the all bytes. Just putting \0 at the beginning of the buffer is sufficient. And the Memset() at the beginning of basic_archive_file_internal is not needed since that static variables are initially initialized to zeros. This is not necessarily needed, but it might be better we empty tempfilepath after unlinking the file. > > +static char tempfilepath[MAXPGPATH + 256]; > > > > MAXPGPATH is the maximum length of a file name that PG assumes to be > > able to handle. Thus extending that length seems wrong. > > I think it was to accommodate the temp file name - "archtemp", file, > MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However, > most of the places the core defines the path name to be MAXPGPATH + > some bytes. Mmm. I found that basic_archive already does the same thing. So lets follow that in this patch. + expectation that a value will soon be provided. Care must be taken when + multiple servers are archiving to the same + <varname>basic_archive.archive_library</varname> directory as they all + might try to archive the same WAL file. I don't understand what kind of care should be taken by reading this.. Anyway the PID + millisecond-resolution timestamps work in the almost all cases, but it's not perfect. So.. I don't come up with what to think about this.. Traditionally we told people that "archiving should not overwrite a file unconditionally. Generally it is safe only when the contents are identical then should be errored-out otherwise.".. Ah this is. https://www.postgresql.org/docs/devel/continuous-archiving.html > Archive commands and libraries should generally be designed to > refuse to overwrite any pre-existing archive file. This is an > important safety feature to preserve the integrity of your archive > in case of administrator error (such as sending the output of two > different servers to the same archive directory). It is advisable to > test your proposed archive library to ensure that it does not > overwrite an existing file. ... > file again after restarting (provided archiving is still > enabled). When an archive command or library encounters a > pre-existing file, it should return a zero status or true, > respectively, if the WAL file has identical contents to the > pre-existing archive and the pre-existing archive is fully persisted > to storage. If a pre-existing file contains different contents than > the WAL file being archived, the archive command or library must > return a nonzero status or false, respectively. On the other hand, basic_archive seems to overwrite existing files unconditionally. I think this is not great, in that we offer a tool betrays to our own wrtten suggestion... The following is out-of-the-scope discussions. ================== At Thu, 20 Oct 2022 13:29:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Thu, Oct 20, 2022 at 6:57 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > The archive module must be > responsible for cleaning up the temp file > > that it creates. One > way to do it is in the archive module's shutdown > > callback, this > covers most of the cases, but not all. > > True. But I agree to > Robert that such temporary files should be > cleanup-able without > needing temporarily-valid knowledge (current file > name, in this > case). A common strategy for this is to name those files > by names > that can be identifed as garbage. > > I'm not sure how we can distinguish temp files as garbage based on > name. As Robert pointed out upthread, using system identifier may not > help as the standbys share the same system identifier and it's > possible that they might archive to the same directory. Is there any > other way? Correct naming scheme would lead to resolution. > > But since power cut is a typical crash source, we need to identify > > ruined temporary files and the current naming convention is incomplete > > in this regard. > > Please note that basic_archive module creates one temp file at a time > to make file copying/moving atomic and it can keep track of the temp > file name and delete it using shutdown callback which helps in most of > the scenarios. As said upthread, repeated crashes while basic_archive > module is atomically copying files around is a big problem in itself > and basic_archive module need not worry about it much. I'm not sure. It's a "basic_archiver", but an "example_archiver". I read the name as "it is no highly configuratable but practically usable". In this criteria, clean up feature is not too much. > > there. However, I think we can identify truly rotten temp files by > > inserting host name or cluster name (means cluster_name in > > postgresql.conf) even in that case. This premise that DBA names every > > Well, these ideas are great! However, we can leave defining such > strategies to archive module implementors. IMO, the basich_archive > module ought to be as simple and elegant as possible yet showing up > the usability of archive modules feature. I don't understand why garbage-cleanup is not elegant. On the other hand, if we pursue minimalism about this tool, we don't need the timestamp part since this tool cannot write two or more files simultaneously by the same process. (I don't mean we shoud remove that part.) > > Of course, it premised that a cluster uses the same name for a > > segment. If we insert cluseter_name into the temprary name, a starting > > cluster can indentify garbage files to clean up. For example if we > > name them as follows. > > > > ARCHTEMP_cluster1_pid_time_<lsn> > > > > A starting cluster can clean up all files starts with > > "archtemp_cluster1_*". (We need to choose the delimiter carefully, > > though..) > > Postgres cleaning up basic_archive modules temp files at the start up > isn't a great idea IMO. Because these files are not related to server > functionality in any way unlike temp files removed in > RemovePgTempFiles(). IMO, we ought to keep the basic_archive module > simple. I think "init" feature is mandatory. But it would be another project. > > > Please see the attached v2 patch. > > > > +static char tempfilepath[MAXPGPATH + 256]; > > > > MAXPGPATH is the maximum length of a file name that PG assumes to be > > able to handle. Thus extending that length seems wrong. > > I think it was to accommodate the temp file name - "archtemp", file, > MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However, > most of the places the core defines the path name to be MAXPGPATH + > some bytes. Oooh. I don't say "most" but some instances are found. (Almost all usage of MAXPGPATH are not accompanied by additional length). But it would be another issue. Anyway I don't think even if the use of over-sized path buffers is widely spread in our tree, it cannot be a reason for this new code need to do the same thing. If we follow that direction, the following code in basic_archive should have MAXPGPATH*2 wide, since it stores a "<directory>/<filename>" construct. (That usage is found in, e.g., dbsize.c, which of course I think stupid..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, the previous mail are sent inadvertently.. At Fri, 21 Oct 2022 14:13:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > + expectation that a value will soon be provided. Care must be taken when > + multiple servers are archiving to the same > + <varname>basic_archive.archive_library</varname> directory as they all > + might try to archive the same WAL file. > > I don't understand what kind of care should be taken by reading this.. > > Anyway the PID + millisecond-resolution timestamps work in the almost > all cases, but it's not perfect. So.. I don't come up with what to > think about this.. > > Traditionally we told people that "archiving should not overwrite a > file unconditionally. Generally it is safe only when the contents are > identical then should be errored-out otherwise.".. Ah this is. basic_archive follows the suggestion if the same file exists before it starts to write a file. So please forget this. > * Sync the temporary file to disk and move it to its final destination. > * Note that this will overwrite any existing file, but this is only > * possible if someone else created the file since the stat() above. I'm not sure why we are allowed to allow this behavior.. But it also would be another issue, if anyone cares. Thus I feel that we might not touch this description in this patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thanks, but we don't need to wipe out the all bytes. Just putting \0 > at the beginning of the buffer is sufficient. Nah, that's not a clean way IMO. > And the Memset() at the > beginning of basic_archive_file_internal is not needed since that > static variables are initially initialized to zeros. Removed. MemSet() after durable_rename() would be sufficient. > This is not necessarily needed, but it might be better we empty > tempfilepath after unlinking the file. I think it's not necessary as the archiver itself is shutting down and I don't think the server calls the shutdown callback twice. However, if we want basic_archive_shutdown() to be more protective against multiple calls (for any reason that we're missing), we can have a static local variable to quickly exit if the callback is already called. instead of MemSet(), but that's not needed I guess. > + expectation that a value will soon be provided. Care must be taken when > + multiple servers are archiving to the same > + <varname>basic_archive.archive_library</varname> directory as they all > + might try to archive the same WAL file. > > I don't understand what kind of care should be taken by reading this.. It's just a notice, however I agree with you that it may be confusing. I've removed it. Please review the attached v4 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote: > On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> Thanks, but we don't need to wipe out the all bytes. Just putting \0 >> at the beginning of the buffer is sufficient. > > Nah, that's not a clean way IMO. Why not? This is a commonly-used technique. I see over 80 existing useѕ in PostgreSQL. Plus, your shutdown callback only checks for the first byte, anyway. + if (tempfilepath[0] == '\0') + return; As noted upthread [0], I think we should be cautious to only remove the temporary file if we know we created it. I still feel that trying to add this cleanup logic to basic_archive is probably more trouble than it's worth, but if there is a safe and effective way to do so, I won't object. [0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sun, Nov 6, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote: > > On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> Thanks, but we don't need to wipe out the all bytes. Just putting \0 > >> at the beginning of the buffer is sufficient. > > > > Nah, that's not a clean way IMO. > > Why not? This is a commonly-used technique. I see over 80 existing useѕ > in PostgreSQL. Plus, your shutdown callback only checks for the first > byte, anyway. > > + if (tempfilepath[0] == '\0') > + return; The tempfile name can vary in size for the simple reason that a pid can be of varying digits - for instance, tempfile name is 'foo1234' (pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the first char to '\0' and say pid wraparound occurred, now the it becomes 'bar5674' (pid being 567). BTW, I couldn't find the 80 existing instances, can you please let me know your search keyword? > As noted upthread [0], I think we should be cautious to only remove the > temporary file if we know we created it. I still feel that trying to add > this cleanup logic to basic_archive is probably more trouble than it's > worth, but if there is a safe and effective way to do so, I won't object. > > [0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13 So, IIUC, your point here is what if the copy_file fails to create the temp file when it already exists. With the name collision being a rare scenario, given the pid and timestamp variables, I'm not sure if copy_file can ever fail because the temp file already exists (with errno EEXIST). However, if we want to be extra-cautious, checking if temp file exists with file_exists() before calling copy_file() might help avoid such cases. If we don't want to have extra system call (via file_exists()) to check the temp file existence, we can think of sending a flag to copy_file(src, dst, &is_dst_file_created) and use is_dst_file_created in the shutdown callback. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 07, 2022 at 04:53:35PM +0530, Bharath Rupireddy wrote: > The tempfile name can vary in size for the simple reason that a pid > can be of varying digits - for instance, tempfile name is 'foo1234' > (pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the > first char to '\0' and say pid wraparound occurred, now the it becomes > 'bar5674' (pid being 567). The call to snprintf() should take care of adding a terminating null byte in the right place. > BTW, I couldn't find the 80 existing instances, can you please let me > know your search keyword? grep "\[0\] = '\\\0'" src -r > So, IIUC, your point here is what if the copy_file fails to create the > temp file when it already exists. With the name collision being a rare > scenario, given the pid and timestamp variables, I'm not sure if > copy_file can ever fail because the temp file already exists (with > errno EEXIST). However, if we want to be extra-cautious, checking if > temp file exists with file_exists() before calling copy_file() might > help avoid such cases. If we don't want to have extra system call (via > file_exists()) to check the temp file existence, we can think of > sending a flag to copy_file(src, dst, &is_dst_file_created) and use > is_dst_file_created in the shutdown callback. Thoughts? Presently, if copy_file() encounters a pre-existing file, it should ERROR, which will be caught in the sigsetjmp() block in basic_archive_file(). The shutdown callback shouldn't run in this scenario. I think this cleanup logic should run in both the shutdown callback and the sigsetjmp() block, but it should only take action (i.e., deleting the leftover temporary file) if the ERROR or shutdown occurs after creating the file in copy_file() and before renaming the temporary file to its final destination. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Nov 8, 2022 at 3:18 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > The call to snprintf() should take care of adding a terminating null byte > in the right place. Ah, my bad. MemSet is avoided in v5 patch setting only the first byte. > > So, IIUC, your point here is what if the copy_file fails to create the > > temp file when it already exists. With the name collision being a rare > > scenario, given the pid and timestamp variables, I'm not sure if > > copy_file can ever fail because the temp file already exists (with > > errno EEXIST). However, if we want to be extra-cautious, checking if > > temp file exists with file_exists() before calling copy_file() might > > help avoid such cases. If we don't want to have extra system call (via > > file_exists()) to check the temp file existence, we can think of > > sending a flag to copy_file(src, dst, &is_dst_file_created) and use > > is_dst_file_created in the shutdown callback. Thoughts? > > Presently, if copy_file() encounters a pre-existing file, it should ERROR, > which will be caught in the sigsetjmp() block in basic_archive_file(). The > shutdown callback shouldn't run in this scenario. Determining the "file already exists" error/EEXIST case from a bunch of other errors in copy_file() is tricky. However, I quickly hacked up copy_file() by adding elevel parameter, please see the attached v5-0001. > I think this cleanup logic should run in both the shutdown callback and the > sigsetjmp() block, but it should only take action (i.e., deleting the > leftover temporary file) if the ERROR or shutdown occurs after creating the > file in copy_file() and before renaming the temporary file to its final > destination. Please see attached v5 patch set. If the direction seems okay, I'll add a CF entry. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com