Обсуждение: Use durable_unlink for .ready and .done files for WAL segment removal
Hi all, While reviewing the archiving code, I have bumped into the fact that XLogArchiveCleanup() thinks that it is safe to do only a plain unlink() for .ready and .done files when removing a past segment. I don't think that it is a smart move, as on a subsequent crash we may still see those, but the related segment would have gone away. This is not really a problem for .done files, but it could confuse the archiver to see some .ready files about things that have already gone away. Attached is a patch. Thoughts? -- Michael
Вложения
Hi, On 2018-09-28 12:28:27 +0900, Michael Paquier wrote: > While reviewing the archiving code, I have bumped into the fact that > XLogArchiveCleanup() thinks that it is safe to do only a plain unlink() > for .ready and .done files when removing a past segment. I don't think > that it is a smart move, as on a subsequent crash we may still see > those, but the related segment would have gone away. This is not really > a problem for .done files, but it could confuse the archiver to see some > .ready files about things that have already gone away. Isn't that window fundamentally there anyway? - Andres
On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote: > On 2018-09-28 12:28:27 +0900, Michael Paquier wrote: >> While reviewing the archiving code, I have bumped into the fact that >> XLogArchiveCleanup() thinks that it is safe to do only a plain unlink() >> for .ready and .done files when removing a past segment. I don't think >> that it is a smart move, as on a subsequent crash we may still see >> those, but the related segment would have gone away. This is not really >> a problem for .done files, but it could confuse the archiver to see some >> .ready files about things that have already gone away. > > Isn't that window fundamentally there anyway? Sure. However the point I would like to make is that if we have the possibility to reduce this window, then we should. -- Michael
Вложения
On September 27, 2018 10:23:31 PM PDT, Michael Paquier <michael@paquier.xyz> wrote: >On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote: >> On 2018-09-28 12:28:27 +0900, Michael Paquier wrote: >>> While reviewing the archiving code, I have bumped into the fact that >>> XLogArchiveCleanup() thinks that it is safe to do only a plain >unlink() >>> for .ready and .done files when removing a past segment. I don't >think >>> that it is a smart move, as on a subsequent crash we may still see >>> those, but the related segment would have gone away. This is not >really >>> a problem for .done files, but it could confuse the archiver to see >some >>> .ready files about things that have already gone away. >> >> Isn't that window fundamentally there anyway? > >Sure. However the point I would like to make is that if we have the >possibility to reduce this window, then we should. It's not free though. I don't think this is as clear cut as you make it sound. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > While reviewing the archiving code, I have bumped into the fact that > XLogArchiveCleanup() thinks that it is safe to do only a plain unlink() > for .ready and .done files when removing a past segment. I don't think > that it is a smart move, as on a subsequent crash we may still see > those, but the related segment would have gone away. This is not really > a problem for .done files, but it could confuse the archiver to see some > .ready files about things that have already gone away. Is there an issue with making the archiver able to understand that situation instead of being confused by it..? Seems like that'd probably be a good thing to do regardless of this, but that would then remove the need for this kind of change.. Thanks! Stephen
Вложения
On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote: > Is there an issue with making the archiver able to understand that > situation instead of being confused by it..? Seems like that'd probably > be a good thing to do regardless of this, but that would then remove the > need for this kind of change.. I thought about that a bit, and there is as well a lot which can be done within the archive_command itself regarding that, so I am not sure that there is the argument to make pgarch.c more complicated than it should. Now it is true that for most users having a .ready file but no segment would most likely lead in a failure. I suspect that a large user base is still just using plain cp in archive_command, which would cause the archiver to be stuck. So we could actually just tweak pgarch_readyXlog to check if the segment fetched actually exists (see bottom of the so-said function). If it doesn't, then the archiver removes the .ready file and retries fetching a new segment. -- Michael
Вложения
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote: > > Is there an issue with making the archiver able to understand that > > situation instead of being confused by it..? Seems like that'd probably > > be a good thing to do regardless of this, but that would then remove the > > need for this kind of change.. > > I thought about that a bit, and there is as well a lot which can be done > within the archive_command itself regarding that, so I am not sure that > there is the argument to make pgarch.c more complicated than it should. > Now it is true that for most users having a .ready file but no segment > would most likely lead in a failure. I suspect that a large user base > is still just using plain cp in archive_command, which would cause the > archiver to be stuck. So we could actually just tweak pgarch_readyXlog > to check if the segment fetched actually exists (see bottom of the > so-said function). If it doesn't, then the archiver removes the .ready > file and retries fetching a new segment. Yes, checking if the WAL file exists before calling archive_command on it is what I was thinking we'd do here, and if it doesn't, then just remove the .ready file. An alternative would be to go through the .ready files on crash-recovery and remove any .ready files that don't have corresponding WAL files, or if we felt that it was necessary, we could do that on every restart but do we really think we'd need to do that..? Thanks! Stephen
Вложения
On Fri, Sep 28, 2018 at 07:16:25PM -0400, Stephen Frost wrote: > An alternative would be to go through the .ready files on crash-recovery > and remove any .ready files that don't have corresponding WAL files, or > if we felt that it was necessary, we could do that on every restart but > do we really think we'd need to do that..? Actually, what you are proposing here sounds much better to me. That's in the area of what has been done recently with RemoveTempXlogFiles() in 5fc1008e. Any objections to doing something like that? -- Michael
Вложения
On Sat, Sep 29, 2018 at 04:58:57PM +0900, Michael Paquier wrote: > Actually, what you are proposing here sounds much better to me. That's > in the area of what has been done recently with RemoveTempXlogFiles() in > 5fc1008e. Any objections to doing something like that? Okay. I have hacked a patch based on Stephen's idea as attached. Any opinions? -- Michael
Вложения
One argument for instead checking WAL file existence before calling archive_command might be to avoid the increased startup time. Granted, any added delay from this patch is unlikely to be noticeable unless your archiver is way behind and archive_status has a huge number of files. However, I have seen cases where startup is stuck on other tasks like SyncDataDirectory() and RemovePgTempFiles() for a very long time, so perhaps it is worth considering. Nathan
At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart <bossartn@amazon.com> wrote in <154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org> > One argument for instead checking WAL file existence before calling > archive_command might be to avoid the increased startup time. > Granted, any added delay from this patch is unlikely to be noticeable > unless your archiver is way behind and archive_status has a huge > number of files. However, I have seen cases where startup is stuck on > other tasks like SyncDataDirectory() and RemovePgTempFiles() for a > very long time, so perhaps it is worth considering. While archive_mode is tuned on, .ready files are created for all exising wal files if not exists. Thus archiver may wait for the ealiest segment to have .ready file. As the result pgarch_readyXLog can be modified to loops over wal files, not status files. This prevents the confusion comes from .ready files for non-existent segment files. RemoveXlogFile as is doesn't get confused by .done files for nonexistent segments. We may leave useless .done/.ready files. We no longer scan over the files so no matter how many files are there in the directory. The remaining issue is removal of the files. Even if we blew away the directory altogether, status files would be cleanly recreated having already-archived wal segments are archived again. However, redundant copy won't happen with our recommending configuration:p # Yeah, I see almost all sites uses simple 'cp' or 'scp' for that.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Nov 15, 2018 at 07:39:27PM +0900, Kyotaro HORIGUCHI wrote: > At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart > <bossartn@amazon.com> wrote in > <154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>: >> One argument for instead checking WAL file existence before calling >> archive_command might be to avoid the increased startup time. I guess that you mean the startup of the archive command itself here. Yes that can be an issue with a high WAL output depending on the interpreter of the archive command :( >> Granted, any added delay from this patch is unlikely to be noticeable >> unless your archiver is way behind and archive_status has a huge >> number of files. However, I have seen cases where startup is stuck on >> other tasks like SyncDataDirectory() and RemovePgTempFiles() for a >> very long time, so perhaps it is worth considering. What's the scale of the pg_wal partition and the amount of time things were stuck? I would imagine that the sync phase hurts the most, and a fast startup time for crash recovery is always important. > While archive_mode is tuned on, .ready files are created for all > existing wal files if not exists. Thus archiver may wait for the > earliest segment to have .ready file. Yes, RemoveOldXlogFiles() does that via XLogArchiveCheckDone(). > As the result > pgarch_readyXLog can be modified to loops over WAL files, not > status files. This prevents the confusion comes from .ready > files for non-existent segment files. No, pgarch_readyXLog() should still look after .ready files as those are here for this purpose, but we could have an additional check to see if the segment linked with it actually exists and can be archived. This check could happen in pgarch.c code before calling the archive command gets called (just before pgarch_ArchiverCopyLoop and after XLogArchiveCommandSet feels rather right, and that it should be cheap enough to call stat()). -- Michael
Вложения
On Thu, Nov 22, 2018 at 01:16:09PM +0900, Michael Paquier wrote: > No, pgarch_readyXLog() should still look after .ready files as those are > here for this purpose, but we could have an additional check to see if > the segment linked with it actually exists and can be archived. This > check could happen in pgarch.c code before calling the archive command > gets called (just before pgarch_ArchiverCopyLoop and after > XLogArchiveCommandSet feels rather right, and that it should be cheap > enough to call stat()). s/pgarch_ArchiverCopyLoop/pgarch_archiveXlog/. Attached is a patch showing shaped based on the idea of upthread. Thoughts? -- Michael
Вложения
On 11/21/18, 10:16 PM, "Michael Paquier" <michael@paquier.xyz> wrote: >> At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart >> <bossartn@amazon.com> wrote in >> <154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>: >>> Granted, any added delay from this patch is unlikely to be noticeable >>> unless your archiver is way behind and archive_status has a huge >>> number of files. However, I have seen cases where startup is stuck on >>> other tasks like SyncDataDirectory() and RemovePgTempFiles() for a >>> very long time, so perhaps it is worth considering. > > What's the scale of the pg_wal partition and the amount of time things > were stuck? I would imagine that the sync phase hurts the most, and a > fast startup time for crash recovery is always important. I don't have exact figures to share, but yes, a huge number of calls to sync_file_range() and fsync() can use up a lot of time. Presumably Postgres processes files individually instead of using sync() because sync() may return before writing is done. Also, sync() would affect non-Postgres files. However, it looks like Linux actually does wait for writing to complete before returning from sync() [0]. For RemovePgTempFiles(), the documentation above the function indicates that skipping temp file cleanup during startup would actually be okay because collisions with existing temp file names should be handled by OpenTemporaryFile(). I assume this cleanup is done during startup because there isn't a great alternative besides offloading the work to a new background worker or something. On 11/27/18, 6:35 AM, "Michael Paquier" <michael@paquier.xyz> wrote: > Attached is a patch showing shaped based on the idea of upthread. > Thoughts? I took a look at this patch. + /* + * In the event of a system crash, archive status files may be + * left behind as their removals are not durable, cleaning up + * orphan entries here is the cheapest method. So check that + * the segment trying to be archived still exists. + */ + snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog); + if (stat(pathname, &stat_buf) != 0) + { Don't we also need to check that errno is ENOENT here? + StatusFilePath(xlogready, xlog, ".ready"); + if (durable_unlink(xlogready, WARNING) == 0) + ereport(WARNING, + (errmsg("removed orphan archive status file %s", + xlogready))); + return; IIUC any time that the file does not exist, we will attempt to unlink it. Regardless of whether unlinking fails or succeeds, we then proceed to give up archiving for now, but it's not clear why. Perhaps we should retry unlinking a number of times (like we do for pgarch_archiveXlog()) when durable_unlink() fails and simply "break" to move on to the next .ready file if durable_unlink() succeeds. Nathan [0] http://man7.org/linux/man-pages/man2/sync.2.html
Hi, On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote: > I don't have exact figures to share, but yes, a huge number of calls > to sync_file_range() and fsync() can use up a lot of time. Presumably > Postgres processes files individually instead of using sync() because > sync() may return before writing is done. Also, sync() would affect > non-Postgres files. However, it looks like Linux actually does wait > for writing to complete before returning from sync() [0]. sync() has absolutely no way to report errors. So, we're never going to be able to use it. Besides, even postgres' temp files would be a good reason to not use it. Greetings, Andres Freund
On 11/27/18, 2:46 PM, "Andres Freund" <andres@anarazel.de> wrote: > On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote: >> I don't have exact figures to share, but yes, a huge number of calls >> to sync_file_range() and fsync() can use up a lot of time. Presumably >> Postgres processes files individually instead of using sync() because >> sync() may return before writing is done. Also, sync() would affect >> non-Postgres files. However, it looks like Linux actually does wait >> for writing to complete before returning from sync() [0]. > > sync() has absolutely no way to report errors. So, we're never going to > be able to use it. Besides, even postgres' temp files would be a good > reason to not use it. Ah, I see. Thanks for clarifying. Nathan
On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote: > Don't we also need to check that errno is ENOENT here? Yep. > IIUC any time that the file does not exist, we will attempt to unlink > it. Regardless of whether unlinking fails or succeeds, we then > proceed to give up archiving for now, but it's not clear why. Perhaps > we should retry unlinking a number of times (like we do for > pgarch_archiveXlog()) when durable_unlink() fails and simply "break" > to move on to the next .ready file if durable_unlink() succeeds. Both suggestions sound reasonable to me. (durable_unlink is not called on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does, still I am not completely sure if we actually want to be consistent for the purpose of removing orphaned ready files. -- Michael
Вложения
On 11/27/18, 3:20 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote: >> IIUC any time that the file does not exist, we will attempt to unlink >> it. Regardless of whether unlinking fails or succeeds, we then >> proceed to give up archiving for now, but it's not clear why. Perhaps >> we should retry unlinking a number of times (like we do for >> pgarch_archiveXlog()) when durable_unlink() fails and simply "break" >> to move on to the next .ready file if durable_unlink() succeeds. > > Both suggestions sound reasonable to me. (durable_unlink is not called > on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between > wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does, > still I am not completely sure if we actually want to be consistent for > the purpose of removing orphaned ready files. That sounds good to me. I was actually thinking of using the same retry counter that we use for pgarch_archiveXlog(), but on second thought, it is probably better to have two independent retry counters for these two unrelated operations. Nathan
On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote: > That sounds good to me. I was actually thinking of using the same > retry counter that we use for pgarch_archiveXlog(), but on second > thought, it is probably better to have two independent retry counters > for these two unrelated operations. What I had in mind was two different variables if what I wrote was unclear, possibly with the same value, as archiving failure and failure with orphan file removals are two different concepts. -- Michael
Вложения
On 11/27/18, 3:53 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote: >> That sounds good to me. I was actually thinking of using the same >> retry counter that we use for pgarch_archiveXlog(), but on second >> thought, it is probably better to have two independent retry counters >> for these two unrelated operations. > > What I had in mind was two different variables if what I wrote was > unclear, possibly with the same value, as archiving failure and failure > with orphan file removals are two different concepts. +1 Nathan
On Thu, Nov 29, 2018 at 03:00:42PM +0000, Bossart, Nathan wrote: > +1 Okay, here is an updated patch for this stuff, which does the following: - Check for a WAL segment if it has a ".ready" status file, an orphaned status file is removed only on ENOENT. - If durable_unlink fails, retry 3 times. If there are too many failures, the archiver gives up on the orphan status file removal. If the removal works correctly, the archiver moves on to the next file. (The variable names could be better.) -- Michael
Вложения
On 12/4/18, 1:36 AM, "Michael Paquier" <michael@paquier.xyz> wrote: > Okay, here is an updated patch for this stuff, which does the following: > - Check for a WAL segment if it has a ".ready" status file, an orphaned > status file is removed only on ENOENT. > - If durable_unlink fails, retry 3 times. If there are too many > failures, the archiver gives up on the orphan status file removal. If > the removal works correctly, the archiver moves on to the next file. Thanks for the updated patch! The code looks good to me, the patch applies cleanly and builds without warnings, and it seems to work well in my manual tests. I just have a few wording suggestions. + /* + * In the event of a system crash, archive status files may be + * left behind as their removals are not durable, cleaning up + * orphan entries here is the cheapest method. So check that + * the segment trying to be archived still exists. If it does + * not, its orphan status file is removed. + */ I would phrase this comment this way: Since archive_status files are not durably removed, a system crash could leave behind .ready files for WAL segments that have already been recycled or removed. In this case, simply remove the orphan status file and move on. + ereport(WARNING, + (errmsg("removed orphan archive status file %s", + xlogready))); I think we should put quotes around the file name like we do elsewhere in pgarch_ArchiverCopyLoop(). + ereport(WARNING, + (errmsg("failed removal of \"%s\" too many times, will try again later", + xlogready))); I'd suggest mirroring the log statement for failed archiving commands and saying something like, "removing orphan archive status file \"%s\" failed too many times, will try again later." IMO that makes it clearer what is failing and why we are removing it in the first place. Nathan
On Tue, Dec 04, 2018 at 08:40:40PM +0000, Bossart, Nathan wrote: > Thanks for the updated patch! The code looks good to me, the patch > applies cleanly and builds without warnings, and it seems to work well > in my manual tests. I just have a few wording suggestions. How are you testing this? I just stop the server and manually touch some fake status files in archive_status :) > I would phrase this comment this way: > > Since archive_status files are not durably removed, a system > crash could leave behind .ready files for WAL segments that > have already been recycled or removed. In this case, simply > remove the orphan status file and move on. Fine for me. Thanks! > + ereport(WARNING, > + (errmsg("removed orphan archive status file %s", > + xlogready))); > > I think we should put quotes around the file name like we do elsewhere > in pgarch_ArchiverCopyLoop(). Done. > + ereport(WARNING, > + (errmsg("failed removal of \"%s\" too many times, will try again later", > + xlogready))); > > I'd suggest mirroring the log statement for failed archiving commands > and saying something like, "removing orphan archive status file \"%s\" > failed too many times, will try again later." IMO that makes it > clearer what is failing and why we are removing it in the first place. "removal of" is more consistent here I think, so changed this way with your wording merged. -- Michael
Вложения
On 12/4/18, 7:35 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Tue, Dec 04, 2018 at 08:40:40PM +0000, Bossart, Nathan wrote: >> Thanks for the updated patch! The code looks good to me, the patch >> applies cleanly and builds without warnings, and it seems to work well >> in my manual tests. I just have a few wording suggestions. > > How are you testing this? I just stop the server and manually touch > some fake status files in archive_status :) That's almost exactly what I was doing, too. >> I would phrase this comment this way: >> >> Since archive_status files are not durably removed, a system >> crash could leave behind .ready files for WAL segments that >> have already been recycled or removed. In this case, simply >> remove the orphan status file and move on. > > Fine for me. Thanks! > >> + ereport(WARNING, >> + (errmsg("removed orphan archive status file %s", >> + xlogready))); >> >> I think we should put quotes around the file name like we do elsewhere >> in pgarch_ArchiverCopyLoop(). > > Done. > >> + ereport(WARNING, >> + (errmsg("failed removal of \"%s\" too many times, will try again later", >> + xlogready))); >> >> I'd suggest mirroring the log statement for failed archiving commands >> and saying something like, "removing orphan archive status file \"%s\" >> failed too many times, will try again later." IMO that makes it >> clearer what is failing and why we are removing it in the first place. > > "removal of" is more consistent here I think, so changed this way with > your wording merged. The v4 patch looks good to me! Nathan
Hello. At Wed, 5 Dec 2018 16:11:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in <DB47EBD7-7291-4C39-9F8F-BE42E5193BD5@amazon.com> > The v4 patch looks good to me! durable_unlnk has two modes of faiure. Failure to unlink and fsync. If once it fails at the fsync stage, subsequent durable_unlink calls for the same file always fail to unlink with ENOENT. If durable_unlink is intended to be called repeatedly on falure, perhaps it should return a different code for each failure so that the caller can indentify what to do next. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Dec 06, 2018 at 01:55:46PM +0900, Kyotaro HORIGUCHI wrote: > durable_unlink has two modes of faiure. Failure to unlink and > fsync. If once it fails at the fsync stage, subsequent > durable_unlink calls for the same file always fail to unlink with > ENOENT. If durable_unlink is intended to be called repeatedly on > falure, perhaps it should return a different code for each > failure so that the caller can indentify what to do next. Why? A WARNING would be logged if the first unlink() fails, and another, different WARNING would be logged if the subsequent fsync fails. It looks enough to me to make a distinction between both. Now, you may have a point in the fact that we could also live with only using unlink() for this code path, as even on repetitive crashes this would take care of removing orphan archive status files consistently. -- Michael
Вложения
On Thu, Dec 06, 2018 at 02:43:35PM +0900, Michael Paquier wrote: > Why? A WARNING would be logged if the first unlink() fails, and > another, different WARNING would be logged if the subsequent fsync > fails. It looks enough to me to make a distinction between both. Now, > you may have a point in the fact that we could also live with only using > unlink() for this code path, as even on repetitive crashes this would > take care of removing orphan archive status files consistently. After sleeping on that, using plain unlink() makes indeed the most sense. Any objections if I move on with that, adding a proper comment explaining the choice? I don't plan to finish wrapping this patch today but Monday my time anyway. -- Michael
Вложения
On 12/6/18, 4:54 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Thu, Dec 06, 2018 at 02:43:35PM +0900, Michael Paquier wrote: >> Why? A WARNING would be logged if the first unlink() fails, and >> another, different WARNING would be logged if the subsequent fsync >> fails. It looks enough to me to make a distinction between both. Now, >> you may have a point in the fact that we could also live with only using >> unlink() for this code path, as even on repetitive crashes this would >> take care of removing orphan archive status files consistently. > > After sleeping on that, using plain unlink() makes indeed the most > sense. Any objections if I move on with that, adding a proper comment > explaining the choice? I don't plan to finish wrapping this patch today > but Monday my time anyway. That seems reasonable to me. Nathan
On Thu, Dec 06, 2018 at 11:18:12PM +0000, Bossart, Nathan wrote: > That seems reasonable to me. Thanks, committed after renaming a couple of variables, after adding a comment about the use of unlink() and also adding a comment explaining what the different retry counters are here for. -- Michael