Обсуждение: Re: BUG #15346: Replica fails to start after the crash
Hello hackers! It seems bgwriter running on the replicas is broken in the commit 8d68ee6 and as a result bgwriter never updates minRecoveryPoint in the pg_control.Please see a detailed explanation below. 2018-08-29 22:54 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > This is not a solution in my opinion, as you could invalidate activities > of backends connected to the database when the incorrect consistent > point allows connections to come in too early. That true, but I still think it is better than aborting startup process... > What happens if you replay with hot_standby = on up to the latest point, > without any concurrent connections, then issue a checkpoint on the > standby once you got to a point newer than the complain, and finally > restart the standby with the bgworker? > > Another idea I have would be to make the standby promote, issue a > checkpoint on it, and then use pg_rewind as a trick to update the > control file to a point newer than the inconsistency. As PG < 9.6.10 > could make the minimum recovery point go backwards, applying the upgrade > after the consistent point got to an incorrect state would trigger the > failure. Well, all these actions probably help to relife symptoms and replay WAL up to the point when it becomes really consistent. I was really long trying to figure out how it could happen that some of the pages were written on disk, but pg_control wasn't updated, And I think I managed to put all pieces of the puzzle into a nice picture: static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) { /* Quick check using our local copy of the variable */ if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint)) return; /* * An invalid minRecoveryPoint means that we need to recover all the WAL, * i.e., we're doing crash recovery. We never modify the control file's * value in that case, so we can short-circuit future checks here too. The * local values of minRecoveryPoint and minRecoveryPointTLI should not be * updated until crash recovery finishes. */ if (XLogRecPtrIsInvalid(minRecoveryPoint)) { updateMinRecoveryPoint = false; return; } This code was changed in the commit 8d68ee6 and it broke bgwriter. Now bgwriter never updates pg_control when flushes dirty pages to disk. How it happens: When bgwriter starts, minRecoveryPoint is not initialized and if I attach with gdb, it shows that value of minRecoveryPoint = 0, therefore it is Invalid. As a result, updateMinRecoveryPoint is set to false and on the next call of UpdateMinRecoveryPoint from bgwriter it returns from the function after the very first if. Bgwriter itself never changes updateMinRecoveryPoint to true and boom, we can get a lot of pages written to disk, but minRecoveryPoint in the pg_control won't be updated! If the replica happened to crash in such conditions it reaches a consistency much earlier than it should! Regards, -- Alexander Kukushkin
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote: > Bgwriter itself never changes updateMinRecoveryPoint to true and boom, > we can get a lot of pages written to disk, but minRecoveryPoint in the > pg_control won't be updated! That's indeed obvious by reading the code. The bgwriter would be started only once a consistent point has been reached, so the startup process would have normally already updated the control file to the consistent point. Something like the attached should take care of the problem. As the updates of the local copy of minRecoveryPoint strongly rely on if the startup process is used, I think that we should use InRecovery for the sanity checks. I'd like to also add a TAP test for that, which should be easy enough if we do sanity checks by looking up at the output of the control file. I'll try to put more thoughts on that. Does it take care of the problem? -- Michael
Вложения
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote: > Bgwriter itself never changes updateMinRecoveryPoint to true and boom, > we can get a lot of pages written to disk, but minRecoveryPoint in the > pg_control won't be updated! That's indeed obvious by reading the code. The bgwriter would be started only once a consistent point has been reached, so the startup process would have normally already updated the control file to the consistent point. Something like the attached should take care of the problem. As the updates of the local copy of minRecoveryPoint strongly rely on if the startup process is used, I think that we should use InRecovery for the sanity checks. I'd like to also add a TAP test for that, which should be easy enough if we do sanity checks by looking up at the output of the control file. I'll try to put more thoughts on that. Does it take care of the problem? -- Michael
Вложения
Hi, 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > That's indeed obvious by reading the code. The bgwriter would be > started only once a consistent point has been reached, so the startup > process would have normally already updated the control file to the > consistent point. Something like the attached should take care of the > problem. As the updates of the local copy of minRecoveryPoint strongly > rely on if the startup process is used, I think that we should use > InRecovery for the sanity checks. > > I'd like to also add a TAP test for that, which should be easy enough if > we do sanity checks by looking up at the output of the control file. > I'll try to put more thoughts on that. > > Does it take care of the problem? Yep, with the patch applied bgwriter acts as expected! Breakpoint 1, UpdateControlFile () at xlog.c:4536 4536 INIT_CRC32C(ControlFile->crc); (gdb) bt #0 UpdateControlFile () at xlog.c:4536 #1 0x00005646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784, force=0 '\000') at xlog.c:2597 #2 0x00005646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632 #3 0x00005646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0, reln=0x5646d2e86028) at bufmgr.c:2729 #4 0x00005646d09d63d6 in SyncOneBuffer (buf_id=99693, skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at bufmgr.c:2394 #5 0x00005646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at bufmgr.c:2270 #6 0x00005646d097c266 in BackgroundWriterMain () at bgwriter.c:279 #7 0x00005646d073b38c in AuxiliaryProcessMain (argc=2, argv=0x7ffd07758840) at bootstrap.c:424 #8 0x00005646d098dc4a in StartChildProcess (type=BgWriterProcess) at postmaster.c:5300 #9 0x00005646d098d672 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:4999 #10 <signal handler called> #11 0x00007f8e5f5a6ff7 in __GI___select (nfds=5, readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0, timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41 #12 0x00005646d09890ca in ServerLoop () at postmaster.c:1685 #13 0x00005646d0988799 in PostmasterMain (argc=17, argv=0x5646d2e53390) at postmaster.c:1329 #14 0x00005646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228 Regards, -- Alexander Kukushkin
Hi, 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > That's indeed obvious by reading the code. The bgwriter would be > started only once a consistent point has been reached, so the startup > process would have normally already updated the control file to the > consistent point. Something like the attached should take care of the > problem. As the updates of the local copy of minRecoveryPoint strongly > rely on if the startup process is used, I think that we should use > InRecovery for the sanity checks. > > I'd like to also add a TAP test for that, which should be easy enough if > we do sanity checks by looking up at the output of the control file. > I'll try to put more thoughts on that. > > Does it take care of the problem? Yep, with the patch applied bgwriter acts as expected! Breakpoint 1, UpdateControlFile () at xlog.c:4536 4536 INIT_CRC32C(ControlFile->crc); (gdb) bt #0 UpdateControlFile () at xlog.c:4536 #1 0x00005646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784, force=0 '\000') at xlog.c:2597 #2 0x00005646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632 #3 0x00005646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0, reln=0x5646d2e86028) at bufmgr.c:2729 #4 0x00005646d09d63d6 in SyncOneBuffer (buf_id=99693, skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at bufmgr.c:2394 #5 0x00005646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at bufmgr.c:2270 #6 0x00005646d097c266 in BackgroundWriterMain () at bgwriter.c:279 #7 0x00005646d073b38c in AuxiliaryProcessMain (argc=2, argv=0x7ffd07758840) at bootstrap.c:424 #8 0x00005646d098dc4a in StartChildProcess (type=BgWriterProcess) at postmaster.c:5300 #9 0x00005646d098d672 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:4999 #10 <signal handler called> #11 0x00007f8e5f5a6ff7 in __GI___select (nfds=5, readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0, timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41 #12 0x00005646d09890ca in ServerLoop () at postmaster.c:1685 #13 0x00005646d0988799 in PostmasterMain (argc=17, argv=0x5646d2e53390) at postmaster.c:1329 #14 0x00005646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228 Regards, -- Alexander Kukushkin
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote: > 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>: >> Does it take care of the problem? > > Yep, with the patch applied bgwriter acts as expected! Thanks for double-checking. I have been struggling for a couple of hours to get a deterministic test case out of my pocket, and I did not get one as you would need to get the bgwriter to flush a page before crash recovery finishes, we could do that easily with a dedicated bgworker, now pg_ctl start expects the standby to finish recovery before, which makes it harder to trigger all the time, particularly for slow machines . Anyway, I did more code review and I think that I found another issue with XLogNeedsFlush(), which could enforce updateMinRecoveryPoint to false if called before XLogFlush during crash recovery from another process than the startup process, so if it got called before XLogFlush() we'd still have the same issue for a process doing both operations. Hence, I have come up with the attached, which actually brings back the code to what it was before 8d68ee6 for those routines, except that we have fast-exit paths for the startup process so as it is still able to replay all WAL available and avoid page reference issues post-promotion, deciding when to update its own copy of minRecoveryPoint when it finishes crash recovery. This also saves from a couple of locks on the control file from the startup process. If you apply the patch and try it on your standby, are you able to get things up and working? -- Michael
Вложения
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote: > 2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>: >> Does it take care of the problem? > > Yep, with the patch applied bgwriter acts as expected! Thanks for double-checking. I have been struggling for a couple of hours to get a deterministic test case out of my pocket, and I did not get one as you would need to get the bgwriter to flush a page before crash recovery finishes, we could do that easily with a dedicated bgworker, now pg_ctl start expects the standby to finish recovery before, which makes it harder to trigger all the time, particularly for slow machines . Anyway, I did more code review and I think that I found another issue with XLogNeedsFlush(), which could enforce updateMinRecoveryPoint to false if called before XLogFlush during crash recovery from another process than the startup process, so if it got called before XLogFlush() we'd still have the same issue for a process doing both operations. Hence, I have come up with the attached, which actually brings back the code to what it was before 8d68ee6 for those routines, except that we have fast-exit paths for the startup process so as it is still able to replay all WAL available and avoid page reference issues post-promotion, deciding when to update its own copy of minRecoveryPoint when it finishes crash recovery. This also saves from a couple of locks on the control file from the startup process. If you apply the patch and try it on your standby, are you able to get things up and working? -- Michael
Вложения
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > I have been struggling for a couple of hours to get a deterministic test > case out of my pocket, and I did not get one as you would need to get > the bgwriter to flush a page before crash recovery finishes, we could do In my case the active standby server has crashed, it wasn't in the crash recovery mode. > the time, particularly for slow machines . Anyway, I did more code > review and I think that I found another issue with XLogNeedsFlush(), > which could enforce updateMinRecoveryPoint to false if called before > XLogFlush during crash recovery from another process than the startup > process, so if it got called before XLogFlush() we'd still have the same > issue for a process doing both operations. Hence, I have come up with At least XLogNeedsFlush() is called just from a couple of places and doesn't break bgwriter, but anyway, thanks for finding it. > the attached, which actually brings back the code to what it was before > 8d68ee6 for those routines, except that we have fast-exit paths for the > startup process so as it is still able to replay all WAL available and > avoid page reference issues post-promotion, deciding when to update its > own copy of minRecoveryPoint when it finishes crash recovery. This also > saves from a couple of locks on the control file from the startup > process. Sound good. > > If you apply the patch and try it on your standby, are you able to get > things up and working? Nope, minRecoveryPoint in pg_control is still wrong and therefore startup still aborts on the same place if there are connections open. I think there is no way to fix it other than let it replay sufficient amount of WAL without open connections. Just juddging from the timestamps of WAL files in the pg_xlog it is obvious that a few moments before the hardware crashed postgres was replaying 0000000500000AB300000057, because the next file has smaller mtime (it was recycled). -rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:22 0000000500000AB300000057 -rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:01 0000000500000AB300000058 Minimum recovery ending location is AB3/4A1B3118, but at the same time I managed to find pages from 0000000500000AB300000053 on disk (at least in the index files). That could only mean that bgwriter was flushing dirty pages, but pg_control wasn't properly updated and it happened not during recovery after hardware crash, but while the postgres was running before the hardware crash. The only possible way to recover such standby - cut off all possible connections and let it replay all WAL files it managed to write to disk before the first crash. Regards, -- Alexander Kukushkin
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > I have been struggling for a couple of hours to get a deterministic test > case out of my pocket, and I did not get one as you would need to get > the bgwriter to flush a page before crash recovery finishes, we could do In my case the active standby server has crashed, it wasn't in the crash recovery mode. > the time, particularly for slow machines . Anyway, I did more code > review and I think that I found another issue with XLogNeedsFlush(), > which could enforce updateMinRecoveryPoint to false if called before > XLogFlush during crash recovery from another process than the startup > process, so if it got called before XLogFlush() we'd still have the same > issue for a process doing both operations. Hence, I have come up with At least XLogNeedsFlush() is called just from a couple of places and doesn't break bgwriter, but anyway, thanks for finding it. > the attached, which actually brings back the code to what it was before > 8d68ee6 for those routines, except that we have fast-exit paths for the > startup process so as it is still able to replay all WAL available and > avoid page reference issues post-promotion, deciding when to update its > own copy of minRecoveryPoint when it finishes crash recovery. This also > saves from a couple of locks on the control file from the startup > process. Sound good. > > If you apply the patch and try it on your standby, are you able to get > things up and working? Nope, minRecoveryPoint in pg_control is still wrong and therefore startup still aborts on the same place if there are connections open. I think there is no way to fix it other than let it replay sufficient amount of WAL without open connections. Just juddging from the timestamps of WAL files in the pg_xlog it is obvious that a few moments before the hardware crashed postgres was replaying 0000000500000AB300000057, because the next file has smaller mtime (it was recycled). -rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:22 0000000500000AB300000057 -rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:01 0000000500000AB300000058 Minimum recovery ending location is AB3/4A1B3118, but at the same time I managed to find pages from 0000000500000AB300000053 on disk (at least in the index files). That could only mean that bgwriter was flushing dirty pages, but pg_control wasn't properly updated and it happened not during recovery after hardware crash, but while the postgres was running before the hardware crash. The only possible way to recover such standby - cut off all possible connections and let it replay all WAL files it managed to write to disk before the first crash. Regards, -- Alexander Kukushkin
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote: > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: >> I have been struggling for a couple of hours to get a deterministic test >> case out of my pocket, and I did not get one as you would need to get >> the bgwriter to flush a page before crash recovery finishes, we could do > > In my case the active standby server has crashed, it wasn't in the > crash recovery mode. That's what I meant, a standby crashed and then restarted, doing crash recovery before moving on with archive recovery once it was done with all its local WAL. > Minimum recovery ending location is AB3/4A1B3118, but at the same time > I managed to find pages from 0000000500000AB300000053 on disk (at > least in the index files). That could only mean that bgwriter was > flushing dirty pages, but pg_control wasn't properly updated and it > happened not during recovery after hardware crash, but while the > postgres was running before the hardware crash. Exactly, that would explain the incorrect reference. > The only possible way to recover such standby - cut off all possible > connections and let it replay all WAL files it managed to write to > disk before the first crash. Yeah... I am going to apply the patch after another lookup, that will fix the problem moving forward. Thanks for checking by the way. -- Michael
Вложения
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote: > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: >> I have been struggling for a couple of hours to get a deterministic test >> case out of my pocket, and I did not get one as you would need to get >> the bgwriter to flush a page before crash recovery finishes, we could do > > In my case the active standby server has crashed, it wasn't in the > crash recovery mode. That's what I meant, a standby crashed and then restarted, doing crash recovery before moving on with archive recovery once it was done with all its local WAL. > Minimum recovery ending location is AB3/4A1B3118, but at the same time > I managed to find pages from 0000000500000AB300000053 on disk (at > least in the index files). That could only mean that bgwriter was > flushing dirty pages, but pg_control wasn't properly updated and it > happened not during recovery after hardware crash, but while the > postgres was running before the hardware crash. Exactly, that would explain the incorrect reference. > The only possible way to recover such standby - cut off all possible > connections and let it replay all WAL files it managed to write to > disk before the first crash. Yeah... I am going to apply the patch after another lookup, that will fix the problem moving forward. Thanks for checking by the way. -- Michael
Вложения
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180830185705.GF15446@paquier.xyz> > On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote: > > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > >> I have been struggling for a couple of hours to get a deterministic test > >> case out of my pocket, and I did not get one as you would need to get > >> the bgwriter to flush a page before crash recovery finishes, we could do > > > > In my case the active standby server has crashed, it wasn't in the > > crash recovery mode. > > That's what I meant, a standby crashed and then restarted, doing crash > recovery before moving on with archive recovery once it was done with > all its local WAL. > > > Minimum recovery ending location is AB3/4A1B3118, but at the same time > > I managed to find pages from 0000000500000AB300000053 on disk (at > > least in the index files). That could only mean that bgwriter was > > flushing dirty pages, but pg_control wasn't properly updated and it > > happened not during recovery after hardware crash, but while the > > postgres was running before the hardware crash. > > Exactly, that would explain the incorrect reference. > > > The only possible way to recover such standby - cut off all possible > > connections and let it replay all WAL files it managed to write to > > disk before the first crash. > > Yeah... I am going to apply the patch after another lookup, that will > fix the problem moving forward. Thanks for checking by the way. Please wait a bit.. I have a concern about this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180830185705.GF15446@paquier.xyz> > On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote: > > 2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > >> I have been struggling for a couple of hours to get a deterministic test > >> case out of my pocket, and I did not get one as you would need to get > >> the bgwriter to flush a page before crash recovery finishes, we could do > > > > In my case the active standby server has crashed, it wasn't in the > > crash recovery mode. > > That's what I meant, a standby crashed and then restarted, doing crash > recovery before moving on with archive recovery once it was done with > all its local WAL. > > > Minimum recovery ending location is AB3/4A1B3118, but at the same time > > I managed to find pages from 0000000500000AB300000053 on disk (at > > least in the index files). That could only mean that bgwriter was > > flushing dirty pages, but pg_control wasn't properly updated and it > > happened not during recovery after hardware crash, but while the > > postgres was running before the hardware crash. > > Exactly, that would explain the incorrect reference. > > > The only possible way to recover such standby - cut off all possible > > connections and let it replay all WAL files it managed to write to > > disk before the first crash. > > Yeah... I am going to apply the patch after another lookup, that will > fix the problem moving forward. Thanks for checking by the way. Please wait a bit.. I have a concern about this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote: > Please wait a bit.. I have a concern about this. Sure, please feel free. -- Michael
Вложения
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote: > Please wait a bit.. I have a concern about this. Sure, please feel free. -- Michael
Вложения
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180831014855.GJ15446@paquier.xyz> > On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote: > > Please wait a bit.. I have a concern about this. > > Sure, please feel free. Thanks. I looked though the patch and related code and have a concern. The patch inhibits turning off updateMinRecoveryPoint on other than the startup process running crash-recovery except at the end of XLogNeedsFlush. > /* > * Check minRecoveryPoint for any other process than the startup > * process doing crash recovery, which should not update the control > * file value if crash recovery is still running. > */ > if (XLogRecPtrIsInvalid(minRecoveryPoint)) > updateMinRecoveryPoint = false; Even if any other processes than the startup calls the function during crash recovery, they don't have a means to turn on updateMinRecoveryPoint again. Actually the only process that calls the function druing crash recovery is the startup. bwriter and checkpointer doesn't. Hot-standby backends come after crash-recvery ends. After all, we just won't have an invalid minRecoveryPoint value there, and updateMinRecoverypoint must be true. Other than that I didn't find a problem. Please find the attached patch. I also have not come up with how to check the case automatically.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 493f1db7b9..74692a128d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) * i.e., we're doing crash recovery. We never modify the control file's * value in that case, so we can short-circuit future checks here too. The * local values of minRecoveryPoint and minRecoveryPointTLI should not be - * updated until crash recovery finishes. + * updated until crash recovery finishes. We only do this for the startup + * process as it should not update its own reference of minRecoveryPoint + * until it has finished crash recovery to make sure that all WAL + * available is replayed in this case. This also saves from extra locks + * taken on the control file from the startup process. */ - if (XLogRecPtrIsInvalid(minRecoveryPoint)) + if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery) { updateMinRecoveryPoint = false; return; @@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) minRecoveryPoint = ControlFile->minRecoveryPoint; minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; - if (force || minRecoveryPoint < lsn) + if (XLogRecPtrIsInvalid(minRecoveryPoint)) + updateMinRecoveryPoint = false; + else if (force || minRecoveryPoint < lsn) { XLogRecPtr newMinRecoveryPoint; TimeLineID newMinRecoveryPointTLI; @@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record) if (RecoveryInProgress()) { /* - * An invalid minRecoveryPoint means that we need to recover all the - * WAL, i.e., we're doing crash recovery. We never modify the control - * file's value in that case, so we can short-circuit future checks - * here too. + * An invalid minRecoveryPoint on the startup process means that we + * need to recover all the WAL, i.e., we're doing crash recovery. We + * never modify the control file's value in that case, so we can + * short-circuit future checks here too. This triggers a quick exit + * path for the startup process, which cannot update its local copy of + * minRecoveryPoint as long as it has not replayed all WAL available + * when doing crash recovery. */ - if (XLogRecPtrIsInvalid(minRecoveryPoint)) + if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery) updateMinRecoveryPoint = false; /* Quick exit if already known to be updated or cannot be updated */ @@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record) minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; LWLockRelease(ControlFileLock); + /* + * No other process than the startup doesn't reach here before crash + * recovery ends. So minRecoveryPoint must have a valid value here. + */ + Assert(minRecoveryPoint != InvalidXLogRecPtr); + /* check again */ return record > minRecoveryPoint; }
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180831014855.GJ15446@paquier.xyz> > On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote: > > Please wait a bit.. I have a concern about this. > > Sure, please feel free. Thanks. I looked though the patch and related code and have a concern. The patch inhibits turning off updateMinRecoveryPoint on other than the startup process running crash-recovery except at the end of XLogNeedsFlush. > /* > * Check minRecoveryPoint for any other process than the startup > * process doing crash recovery, which should not update the control > * file value if crash recovery is still running. > */ > if (XLogRecPtrIsInvalid(minRecoveryPoint)) > updateMinRecoveryPoint = false; Even if any other processes than the startup calls the function during crash recovery, they don't have a means to turn on updateMinRecoveryPoint again. Actually the only process that calls the function druing crash recovery is the startup. bwriter and checkpointer doesn't. Hot-standby backends come after crash-recvery ends. After all, we just won't have an invalid minRecoveryPoint value there, and updateMinRecoverypoint must be true. Other than that I didn't find a problem. Please find the attached patch. I also have not come up with how to check the case automatically.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 493f1db7b9..74692a128d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) * i.e., we're doing crash recovery. We never modify the control file's * value in that case, so we can short-circuit future checks here too. The * local values of minRecoveryPoint and minRecoveryPointTLI should not be - * updated until crash recovery finishes. + * updated until crash recovery finishes. We only do this for the startup + * process as it should not update its own reference of minRecoveryPoint + * until it has finished crash recovery to make sure that all WAL + * available is replayed in this case. This also saves from extra locks + * taken on the control file from the startup process. */ - if (XLogRecPtrIsInvalid(minRecoveryPoint)) + if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery) { updateMinRecoveryPoint = false; return; @@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) minRecoveryPoint = ControlFile->minRecoveryPoint; minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; - if (force || minRecoveryPoint < lsn) + if (XLogRecPtrIsInvalid(minRecoveryPoint)) + updateMinRecoveryPoint = false; + else if (force || minRecoveryPoint < lsn) { XLogRecPtr newMinRecoveryPoint; TimeLineID newMinRecoveryPointTLI; @@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record) if (RecoveryInProgress()) { /* - * An invalid minRecoveryPoint means that we need to recover all the - * WAL, i.e., we're doing crash recovery. We never modify the control - * file's value in that case, so we can short-circuit future checks - * here too. + * An invalid minRecoveryPoint on the startup process means that we + * need to recover all the WAL, i.e., we're doing crash recovery. We + * never modify the control file's value in that case, so we can + * short-circuit future checks here too. This triggers a quick exit + * path for the startup process, which cannot update its local copy of + * minRecoveryPoint as long as it has not replayed all WAL available + * when doing crash recovery. */ - if (XLogRecPtrIsInvalid(minRecoveryPoint)) + if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery) updateMinRecoveryPoint = false; /* Quick exit if already known to be updated or cannot be updated */ @@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record) minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; LWLockRelease(ControlFileLock); + /* + * No other process than the startup doesn't reach here before crash + * recovery ends. So minRecoveryPoint must have a valid value here. + */ + Assert(minRecoveryPoint != InvalidXLogRecPtr); + /* check again */ return record > minRecoveryPoint; }
On Fri, Aug 31, 2018 at 02:52:06PM +0900, Kyotaro HORIGUCHI wrote: > The patch inhibits turning off updateMinRecoveryPoint on other > than the startup process running crash-recovery except at the end > of XLogNeedsFlush. Yes that's a matter of safety, as I put into the truck any modules which may use XLogFlush(). And that maps with the old code, so there is no more surprise. > Even if any other processes than the startup calls the function > during crash recovery, they don't have a means to turn on > updateMinRecoveryPoint again. Actually the only process that > calls the function during crash recovery is the startup. bwriter > and checkpointer doesn't. Hot-standby backends come after > crash-recvery ends. After all, we just won't have an invalid > minRecoveryPoint value there, and updateMinRecoverypoint must be > true. Yes, until a recovery point is reached only the startup process could call that. Now I would imagine that we could have a background worker as well which is spawned when the postmaster starts, and calls those code paths... > + /* > + * No other process than the startup doesn't reach here before crash > + * recovery ends. So minRecoveryPoint must have a valid value here. > + */ > + Assert(minRecoveryPoint != InvalidXLogRecPtr); ... And that would invalidate your assertion here. -- Michael
Вложения
On Fri, Aug 31, 2018 at 02:52:06PM +0900, Kyotaro HORIGUCHI wrote: > The patch inhibits turning off updateMinRecoveryPoint on other > than the startup process running crash-recovery except at the end > of XLogNeedsFlush. Yes that's a matter of safety, as I put into the truck any modules which may use XLogFlush(). And that maps with the old code, so there is no more surprise. > Even if any other processes than the startup calls the function > during crash recovery, they don't have a means to turn on > updateMinRecoveryPoint again. Actually the only process that > calls the function during crash recovery is the startup. bwriter > and checkpointer doesn't. Hot-standby backends come after > crash-recvery ends. After all, we just won't have an invalid > minRecoveryPoint value there, and updateMinRecoverypoint must be > true. Yes, until a recovery point is reached only the startup process could call that. Now I would imagine that we could have a background worker as well which is spawned when the postmaster starts, and calls those code paths... > + /* > + * No other process than the startup doesn't reach here before crash > + * recovery ends. So minRecoveryPoint must have a valid value here. > + */ > + Assert(minRecoveryPoint != InvalidXLogRecPtr); ... And that would invalidate your assertion here. -- Michael
Вложения
On Thu, Aug 30, 2018 at 11:23:54PM -0700, Michael Paquier wrote: > Yes that's a matter of safety, as I put into the truck any modules which > may use XLogFlush(). And that maps with the old code, so there is no > more surprise. Okay, I have pushed my previous version as that's the safest approach. If you have any idea of improvements or clarifications, let's discuss about those on a different thread and only for HEAD. Many thanks Alexander for the whole work and Horiguchi-san for the review. -- Michael
Вложения
On Thu, Aug 30, 2018 at 11:23:54PM -0700, Michael Paquier wrote: > Yes that's a matter of safety, as I put into the truck any modules which > may use XLogFlush(). And that maps with the old code, so there is no > more surprise. Okay, I have pushed my previous version as that's the safest approach. If you have any idea of improvements or clarifications, let's discuss about those on a different thread and only for HEAD. Many thanks Alexander for the whole work and Horiguchi-san for the review. -- Michael