Обсуждение: removing global variable ThisTimeLineID
Hi, The global variable ThisTimeLineID is rather confusing. During recovery, in the startup process, when we're reading a WAL record, it is the timeline of the WAL record we are trying to read or have just read, except when we're trying to read the initial checkpoint record, when it's zero. In other processes, it's typically 0 throughout recovery, but sometimes it's not, because for example CreateRestartPoint temporarily sets it to the timeline from which we're replaying WAL, since there's no other way to get RemoveOldXlogFiles() to recycle files onto the right timeline, or PreallocXlogFiles() to preallocate onto the right timeline. Similarly, walreceiver and walsender find it convenient to make ThisTimeLineID the timeline from which WAL is being replayed or at least the one from which it was being replayed at last check. logicalfuncs.c and slotfuncs.c set the global variable sometimes too, apparently just for fun, as the value doesn't seem to get used for anything. In normal running, once the startup process exits, the next call to RecoveryInProgress() sets the value to the timeline into which new WAL can be inserted. Note also that all of this is different from XLogCtl->ThisTimeLineID, which is set at the end of recovery and thus, in the startup process, generally different from ThisTImeLineID, but in other processes, generally the same as ThisTimeLineID. At the risk of stating the obvious, using the same variable for different purposes at different times is not a great programming practice. Commits 2f5c4397c39dea49c5608ba583868e26d767fc32 and 902a2c280012557b85c7e0fce3f6f0e355cb2d69 show that the possibility of programmer error is not zero, even though neither of those issues are serious. Moreover, the global variable itself seems to do nothing but invite programming errors. The name itself is a disaster. What is "this" timeline ID? Well, uh, the right one, of course! We should be more clear about what we mean: the timeline into which we are inserting, the timeline from which we are replaying, the timeline from which we are performing logical decoding. I suspect that having a global variable here isn't even really saving us anything much, as a value that does not change can be read from shared memory without a lock. I would like to clean this up. Attached is a series of patches which try to do that. For ease of review, this is separated out into quite a few separate patches, but most likely we'd combine some of them for commit. Patches 0001 through 0004 eliminate all use of the global variable ThisTimeLineID outside of xlog.c, and patch 0005 makes it "static" so that it ceases to be visible outside of xlog.c. Patches 0006 to 0010 clean up uses of ThisTimeLineID itself, passing it around as a function parameter, or otherwise arranging to fetch the relevant timeline ID from someplace sensible rather than using the global variable as a scratchpad. Finally, patch 0011 deletes the global variable. I have not made a serious attempt to clear up all of the terminological confusion created by the term ThisTimeLineID, which also appears as part of some structs, so you'll still see that name in the source code after applying these patches, just not as the name of a global variable. I have, however, used names like insertTLI and replayTLI in many places changed by the patch, and I think that makes it significantly more clear which timeline is being discussed in each case. In some places I have endeavored to improve the comments. There is doubtless more that could be done here, but I think this is a fairly respectable start. Opinions welcome. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
- 0003-walsender.c-Don-t-rely-on-the-global-variable-ThisTi.patch
- 0004-walreceiver.c-Don-t-rely-on-the-global-variable-This.patch
- 0002-Add-GetWALInsertionTimeLine-also-return-it-via-GetFl.patch
- 0001-Don-t-set-ThisTimeLineID-when-there-s-no-reason-to-d.patch
- 0006-xlog.c-Change-assorted-functions-to-take-an-explicit.patch
- 0005-Change-ThisTimeLineID-to-be-static-rather-than-exter.patch
- 0009-xlog.c-Make-xlog_redo-not-depend-on-ThisTimeLineID-g.patch
- 0008-xlog.c-Use-XLogCtl-ThisTimeLineID-in-various-places.patch
- 0010-xlog.c-Adjust-some-more-functions-to-pass-the-TLI-ar.patch
- 0007-xlog.c-Invent-openLogTLI.patch
- 0011-Demote-ThisTimeLineID-to-a-local-variable.patch
On Mon, Nov 01, 2021 at 03:16:27PM -0400, Robert Haas wrote: > At the risk of stating the obvious, using the same variable for > different purposes at different times is not a great programming > practice. Commits 2f5c4397c39dea49c5608ba583868e26d767fc32 and > 902a2c280012557b85c7e0fce3f6f0e355cb2d69 show that the possibility of > programmer error is not zero, even though neither of those issues are > serious. This can lead to more serious issues, see 595b9cba as recent example. I agree that getting rid of it in the long term is appealing. > Moreover, the global variable itself seems to do nothing but > invite programming errors. The name itself is a disaster. What is > "this" timeline ID? Well, uh, the right one, of course! We should be > more clear about what we mean: the timeline into which we are > inserting, the timeline from which we are replaying, the timeline from > which we are performing logical decoding. I suspect that having a > global variable here isn't even really saving us anything much, as a > value that does not change can be read from shared memory without a > lock. Some callbacks related to the WAL reader to read a patch have introduced a dependency to ThisTimelineID as a mean to be used for logical decoding on standbys. This is discussed a bit on this thread, for example: https://www.postgresql.org/message-id/ddc31aa9-8083-58b7-0b47-e371cd4c705b@oss.nttdata.com Tracking properly the TLI in logical decoding was mentioned by Andres multiple times on -hackers, and this has not been really done. Using ThisTimelineID for this purpose is something I've found confusing myself, FWIW, so this is a step in the good direction. > I would like to clean this up. Attached is a series of patches which > try to do that. For ease of review, this is separated out into quite a > few separate patches, but most likely we'd combine some of them for > commit. Patches 0001 through 0004 eliminate all use of the global > variable ThisTimeLineID outside of xlog.c, and patch 0005 makes it > "static" so that it ceases to be visible outside of xlog.c. Patches > 0006 to 0010 clean up uses of ThisTimeLineID itself, passing it around > as a function parameter, or otherwise arranging to fetch the relevant > timeline ID from someplace sensible rather than using the global > variable as a scratchpad. Finally, patch 0011 deletes the global > variable. > > I have not made a serious attempt to clear up all of the > terminological confusion created by the term ThisTimeLineID, which > also appears as part of some structs, so you'll still see that name in > the source code after applying these patches, just not as the name of > a global variable. I have, however, used names like insertTLI and > replayTLI in many places changed by the patch, and I think that makes > it significantly more clear which timeline is being discussed in each > case. In some places I have endeavored to improve the comments. There > is doubtless more that could be done here, but I think this is a > fairly respectable start. Thanks for splitting things this way. This helps a lot. 0001 looks fair. + /* + * If we're writing and flushing WAL, the time line can't be changing, + * so no lock is required. + */ + if (insertTLI) + *insertTLI = XLogCtl->ThisTimeLineID; In 0002, there is no downside in putting that in the spinlock section either, no? It seems to me that we should be able to call this one even in recovery, as flush LSNs exist while in recovery. +TimeLineID +GetWALInsertionTimeLine(void) +{ + Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE); Okay, that's cheaper. The series 0003-0006 seemed rather fine, at quick glance. In 0007, XLogFileClose() should reset openLogTLI. The same can be said in BootStrapXLOG() before creating the control file. It may be cleaner here to introduce an InvalidTimelineId, as discussed a couple of days ago. Some comments at the top of recoveryTargetTLIRequested need a refresh with their references to ThisTimelineID. The handling of XLogCtl->ThisTimeLineID still seems a bit blurry when it comes to timeline switches because we don't update it while in recovery when replaying a end-of-recovery record. This could at least be documented better. -- Michael
Вложения
On Mon, Nov 1, 2021 at 11:33 PM Michael Paquier <michael@paquier.xyz> wrote: > + /* > + * If we're writing and flushing WAL, the time line can't be changing, > + * so no lock is required. > + */ > + if (insertTLI) > + *insertTLI = XLogCtl->ThisTimeLineID; > In 0002, there is no downside in putting that in the spinlock section > either, no? It seems to me that we should be able to call this one > even in recovery, as flush LSNs exist while in recovery. I don't think it matters very much from a performance point of view, although putting a branch inside of a spinlock-protected section may be undesirable. My bigger issues with this kind of idea is that we don't want to use spinlocks as a "security blanket" (as in Linus from Peanuts). Every time we use a spinlock, or any other kind of locking mechanism, we should have a clear idea what we're protecting ourselves against. I'm quite suspicious that there are a number of places where we're taking spinlocks in xlog.c just to feel virtuous, and not because it does anything useful, and I don't particularly want to increase the number of such places. Also, XLogCtl->ThisTimeLineID isn't set until the end of recovery, so calling this function during recovery would be a mistake. There seem to be a number of interface functions in xlog.c that should only be called when not in recovery, and neither their names nor their comments make that clear. I wasn't bold enough to go label all the ones I think fall into that category, but maybe we should. Honestly, the naming of things in this file is annoyingly bad in general. My favorite example, found during this project, is: #define ConvertToXSegs(x, segsize) XLogMBVarToSegs((x), (segsize)) It's not good enough to have one name that's difficult to understand or remember or capitalize properly -- let's have TWO -- for the same thing! > In 0007, XLogFileClose() should reset openLogTLI. The same can be > said in BootStrapXLOG() before creating the control file. It may be > cleaner here to introduce an InvalidTimelineId, as discussed a couple > of days ago. I thought about that, but I think it's pointless. I think we can just say that if openLogFile == -1, openLogSegNo and openLogTLI are meaningless. I don't think we're currently resetting openLogSegNo to an invalid value either, so I don't think openLogTLI should be treated differently. I'm not opposed to introducing InvalidTimeLineID on principle, and I looked for a way of doing it in this patch set, but it didn't seem all that valuable, and I feel that someone who cares about it can do it as a separate effort after I commit these. Honestly, I think these patches greatly reduce the need to be afraid of leaving the TLI unset, because they remove the crutch of hoping that ThisTimeLineID is initialized to the proper value. You actually have to think about which TLI you are going to pass. Now I'm very sure further improvement is possible, but I think this patch set is complex enough already, and I'd like to get it committed before contemplating any other work in this area. > Some comments at the top of recoveryTargetTLIRequested need a refresh > with their references to ThisTimelineID. After thinking this over, I think we should just delete the entire first paragraph, so that the comments start like this: /* * recoveryTargetTimeLineGoal: what the user requested, if any * * recoveryTargetTLIRequested: numeric value of requested timeline, if constant I don't see that we'd be losing anything that way. When you look at that paragraph now, the first sentence is just confusing the issue, because we actually don't care about XLogCtl->ThisTimeLineID during recovery, because it's not set. We do care about plain old ThisTimeLineID, but that's going to be demoted to a local variable by these patches, and there just doesn't seem to be any reason to discuss it here. The second sentence is wrong already, because the rmgr routines do not rely on ThisTimeLineID. And the third sentence is also wrong and/or confusing, because not all of the variables that follow are TLIs -- only 3 of 5 are. I don't actually think there's any useful introduction we need to provide here; we just need to tell people what the variables we're about to declare do. > The handling of XLogCtl->ThisTimeLineID still seems a bit blurry when > it comes to timeline switches because we don't update it while in > recovery when replaying a end-of-recovery record. This could at least > be documented better. We don't update it during recovery at all. There's exactly one assignment statement for that variable and it's here: /* Save the selected TimeLineID in shared memory, too */ XLogCtl->ThisTimeLineID = ThisTimeLineID; XLogCtl->PrevTimeLineID = PrevTimeLineID; That's after the main redo loop completes. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 2, 2021 at 12:46 AM Robert Haas <robertmhaas@gmail.com> wrote: > > [...] > I would like to clean this up. Attached is a series of patches which > try to do that. For ease of review, this is separated out into quite a > few separate patches, but most likely we'd combine some of them for > commit. Patches 0001 through 0004 eliminate all use of the global > variable ThisTimeLineID outside of xlog.c, and patch 0005 makes it > "static" so that it ceases to be visible outside of xlog.c. Patches > 0006 to 0010 clean up uses of ThisTimeLineID itself, passing it around > as a function parameter, or otherwise arranging to fetch the relevant > timeline ID from someplace sensible rather than using the global > variable as a scratchpad. Finally, patch 0011 deletes the global > variable. > > I have not made a serious attempt to clear up all of the > terminological confusion created by the term ThisTimeLineID, which > also appears as part of some structs, so you'll still see that name in > the source code after applying these patches, just not as the name of > a global variable. I have, however, used names like insertTLI and > replayTLI in many places changed by the patch, and I think that makes > it significantly more clear which timeline is being discussed in each > case. In some places I have endeavored to improve the comments. There > is doubtless more that could be done here, but I think this is a > fairly respectable start. > > Opinions welcome. > I had a plan to look at these patches closely, but unfortunately, didn't get enough time; and might not be able to spend time in the rest of this week. Here are a few thoughts that I had in the initial go-through, but that may or may not sound very interesting: 0002: -GetFlushRecPtr(void) +GetFlushRecPtr(TimeLineID *insertTLI) Should we rename this argument to more generic as "tli", like GetStandbyFlushRecPtr, since the caller in 003 patch trying to fetch a TLI that means different for them, e.g. currTLI, FlushTLI, etc. 0004: static int -XLogFileInitInternal(XLogSegNo logsegno, bool *added, char *path) +XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, + bool *added, char *path) int -XLogFileInit(XLogSegNo logsegno) +XLogFileInit(XLogSegNo logsegno, TimeLineID logtli) { How about logsegtli instead, to be inlined with logsegno ? 0007: static XLogSegNo openLogSegNo = 0; +static TimeLineID openLogTLI = 0; Similarly, openLogSegTLI ? 0008: + /* + * Given that we're not in recovery, ThisTimeLineID is set and can't + * change, so we can read it without a lock. + */ + insertTLI = XLogCtl->ThisTimeLineID; Can't GetWALInsertionTimeLine() called instead? I guess the reason is to avoid the unnecessary overhead involved in the function call. (Same at a few other places.) Regards, Amul
On Wed, Nov 3, 2021 at 9:12 AM Amul Sul <sulamul@gmail.com> wrote: > 0002: > -GetFlushRecPtr(void) > +GetFlushRecPtr(TimeLineID *insertTLI) > > Should we rename this argument to more generic as "tli", like > GetStandbyFlushRecPtr, since the caller in 003 patch trying to fetch a > TLI that means different for them, e.g. currTLI, FlushTLI, etc. It's possible that more consistency would be better here, but I don't think making this name more generic is going in the right direction. If somebody gets the TLI using this function, it's the timeline into which a system not in recovery is inserting WAL, which is the same timeline to which WAL is being flushed. I think it's important for this name to try to make that clear. It doesn't really matter if someone treats the insertTLI as the writeTLI or the flushTLI since on a system in production they're all the same - but they must not confuse it with, say, the replayTLI. > How about logsegtli instead, to be inlined with logsegno ? > Similarly, openLogSegTLI ? Hmm, well I guess it depends on how you think the words group. If you logsegno means "the number of the log segment" then it would make sense to have logsegli to mean "the tli of the log segment." But I think of logsegno as meaning "the segment number of the log file," so it makes more sense to have logtli to mean "the TLI of the log file". I think it's definitely not a "log segment file" - just a "log file". > Can't GetWALInsertionTimeLine() called instead? I guess the reason is > to avoid the unnecessary overhead involved in the function call. (Same > at a few other places.) That, plus to avoid failing the assertion in that function. As we've discussed on the ALTER SYSTEM READ ONLY thread, xlog.c itself inserts some WAL before recovery is officially over. -- Robert Haas EDB: http://www.enterprisedb.com
I looked at these in a very quick skim. I agree that this is a good step in a good direction. I 'git rebase -x'd this series in order to compile and run the tests on each patch individually. There are no compiler warnings anywhere and no test failures. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo"
On Fri, Nov 5, 2021 at 9:49 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I looked at these in a very quick skim. I agree that this is a > good step in a good direction. Cool. I have now committed these. I grouped them into just 3 commits rather than having 11 separate commits as I did in my earlier posting, but the content is the same. > I 'git rebase -x'd this series in order to compile and run the tests on > each patch individually. There are no compiler warnings anywhere and no > test failures. Thanks for checking. It's hard to tell whether the stuff that works on my machine will work everywhere, unless someone tests it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 02, 2021 at 08:45:57AM -0400, Robert Haas wrote: > Also, XLogCtl->ThisTimeLineID isn't set until the end of recovery, so > calling this function during recovery would be a mistake. There seem > to be a number of interface functions in xlog.c that should only be > called when not in recovery, and neither their names nor their > comments make that clear. I wasn't bold enough to go label all the > ones I think fall into that category, but maybe we should. I got to wonder whether it would be better to add in GetFlushRecPtr() the same assertion as GetWALInsertionTimeLine(). All the in-core callers of this routine already assume that, and it would be buggy if one passes down insertTLI to get the current TLI. > I thought about that, but I think it's pointless. I think we can just > say that if openLogFile == -1, openLogSegNo and openLogTLI are > meaningless. I don't think we're currently resetting openLogSegNo to > an invalid value either, so I don't think openLogTLI should be treated > differently. Wouldn't it be better to at least document that as a comment rather than letting the reader guess it, then? I agree that this is a minor point, though. > I'm not opposed to introducing InvalidTimeLineID on > principle, and I looked for a way of doing it in this patch set, but > it didn't seem all that valuable, and I feel that someone who cares > about it can do it as a separate effort after I commit these. No issues from me. I have bumped into a case just at the end of last week where I wanted a better way to tell if a TLI is invalid rather than leaving things to 0, FWIW. >> The handling of XLogCtl->ThisTimeLineID still seems a bit blurry when >> it comes to timeline switches because we don't update it while in >> recovery when replaying a end-of-recovery record. This could at least >> be documented better. > > We don't update it during recovery at all. There's exactly one > assignment statement for that variable and it's here: Indeed. It looks like my reading was sloppy here. -- Michael
Вложения
On Sun, Nov 7, 2021 at 11:24 PM Michael Paquier <michael@paquier.xyz> wrote: > I got to wonder whether it would be better to add in GetFlushRecPtr() > the same assertion as GetWALInsertionTimeLine(). All the in-core > callers of this routine already assume that, and it would be buggy if > one passes down insertTLI to get the current TLI. Perhaps. That's related to the point I made before, that it might be a good idea to be more clear about which of these functions are intended to be used in recovery and which ones are intended to be used in normal running. I don't rule out the possibility of doing some more research here and tightening that up, but I don't want to go start tweaking things unless I'm sure I know what I'm talking about. This stuff is so confusing that I don't want to take any risk of making changes that turn out to be incorrect. > > I thought about that, but I think it's pointless. I think we can just > > say that if openLogFile == -1, openLogSegNo and openLogTLI are > > meaningless. I don't think we're currently resetting openLogSegNo to > > an invalid value either, so I don't think openLogTLI should be treated > > differently. > > Wouldn't it be better to at least document that as a comment rather > than letting the reader guess it, then? I agree that this is a minor > point, though. I don't think it's very confusing, but if you want to improve the comments, go for it! > >> The handling of XLogCtl->ThisTimeLineID still seems a bit blurry when > >> it comes to timeline switches because we don't update it while in > >> recovery when replaying a end-of-recovery record. This could at least > >> be documented better. > > > > We don't update it during recovery at all. There's exactly one > > assignment statement for that variable and it's here: > > Indeed. It looks like my reading was sloppy here. I can hardly blame you. There are comments all over the place claiming that this thing and that thing are the same when they're only sometimes the same. Or, as in this case, things that are named the same when they're actually different. I'm actually tempted to make a pass over the comments related to this topic and just try to make things more comprehensible. It's super-confusing as things stand. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 8, 2021 at 8:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > Perhaps. That's related to the point I made before, that it might be a > good idea to be more clear about which of these functions are intended > to be used in recovery and which ones are intended to be used in > normal running. I don't rule out the possibility of doing some more > research here and tightening that up, but I don't want to go start > tweaking things unless I'm sure I know what I'm talking about. This > stuff is so confusing that I don't want to take any risk of making > changes that turn out to be incorrect. Well I went through and it seems to be OK: all the existing callers of that function first verify that we're not in recovery. The patch to make logical decoding work in standby mode might change that, but as of now it's so. So in the attached patch, I have added the assertion and comments that you have proposed there. I also removed the misleading comment we discussed before. In addition to that, I renamed the local variable ThisTimeLineID to replayTLI, and I also renamed XLogCtl->ThisTimeLineID to InsertTimeLineID, which I think should make things clearer: imagine if things that were used for different purposes had different names! Even with this patch, the name ThisTimeLineID is still used for multiple purposes. It remains part of the CheckPoint struct, and also part of the xl_end_of_recovery struct. But in both of those cases, the name ThisTimeLineID actually makes sense, because what we're thinking about is whether there's a timeline change, so we have ThisTimeLineID and PrevTimeLineID and it seems fairly clear what that's supposed to mean. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
On Mon, Nov 08, 2021 at 12:49:52PM -0500, Robert Haas wrote: > Even with this patch, the name ThisTimeLineID is still used for > multiple purposes. It remains part of the CheckPoint struct, and also > part of the xl_end_of_recovery struct. But in both of those cases, the > name ThisTimeLineID actually makes sense, because what we're thinking > about is whether there's a timeline change, so we have ThisTimeLineID > and PrevTimeLineID and it seems fairly clear what that's supposed to > mean. I got to wonder if it would be better to bite the bullet here for these two looking at your patch, but I am fine to keep things as they are, as well. > Thoughts? I think that this patch is an improvement. @@ -6686,8 +6682,8 @@ StartupXLOG(void) - TimeLineID ThisTimeLineID, - PrevTimeLineID; + TimeLineID replayTLI, + newTLI; One problem with newTLI is that this conflicts with the declaration around 7663 in xlog.c, where we check after a TLI switch when replaying such a record. Perhaps this could be named newInsertTLI, for example. -- Michael
Вложения
On Mon, Nov 8, 2021 at 10:31 PM Michael Paquier <michael@paquier.xyz> wrote: > I think that this patch is an improvement. Cool. > @@ -6686,8 +6682,8 @@ StartupXLOG(void) > - TimeLineID ThisTimeLineID, > - PrevTimeLineID; > + TimeLineID replayTLI, > + newTLI; > One problem with newTLI is that this conflicts with the declaration > around 7663 in xlog.c, where we check after a TLI switch when > replaying such a record. Perhaps this could be named newInsertTLI, > for example. That's a good point. However, since I think newTLI is already in use in some of the functions StartupXLOG() calls, and since I think it would be good to use the same name in StartupXLOG() as in the functions that it calls, what I'd prefer to do is rename the newTLI variable in the inner scope to newReplayTLI, as in the attached v2. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
On Tue, Nov 09, 2021 at 02:15:51PM -0500, Robert Haas wrote: > That's a good point. However, since I think newTLI is already in use > in some of the functions StartupXLOG() calls, and since I think it > would be good to use the same name in StartupXLOG() as in the > functions that it calls, what I'd prefer to do is rename the newTLI > variable in the inner scope to newReplayTLI, as in the attached v2. WFM. Thanks. -- Michael
Вложения
On Tue, Nov 9, 2021 at 7:41 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Nov 09, 2021 at 02:15:51PM -0500, Robert Haas wrote: > > That's a good point. However, since I think newTLI is already in use > > in some of the functions StartupXLOG() calls, and since I think it > > would be good to use the same name in StartupXLOG() as in the > > functions that it calls, what I'd prefer to do is rename the newTLI > > variable in the inner scope to newReplayTLI, as in the attached v2. > > WFM. Thanks. Cool. Committed that way. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 11/8/21 6:49 PM, Robert Haas wrote: > On Mon, Nov 8, 2021 at 8:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > Well I went through and it seems to be OK: all the existing callers of > that function first verify that we're not in recovery. The patch to > make logical decoding work in standby mode might change that, Indeed, I think the logical decoding on standby patch just needs to change the Assert in GetWALInsertionTimeLine() to check SharedRecoveryState is RECOVERY_STATE_DONE or RECOVERY_STATE_ARCHIVE. Would that make sense from your point of view to add the check on RECOVERY_STATE_ARCHIVE or do you think it would need to be more "sophisticated"? Thanks Bertrand
On Tue, Nov 23, 2021 at 4:36 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > Indeed, I think the logical decoding on standby patch just needs to > change the Assert in GetWALInsertionTimeLine() to check > SharedRecoveryState is RECOVERY_STATE_DONE or RECOVERY_STATE_ARCHIVE. > > Would that make sense from your point of view to add the check on > RECOVERY_STATE_ARCHIVE or do you think it would need to be more > "sophisticated"? Unless I'm confused, that would just be incorrect. GetWALInsertionTimeLine() just returns InsertTimeLineID, which won't be initialized during archive recovery. I discussed this point a bit on some other thread with Andres. It's possible that all that you need to do here is determine the replayTLI using e.g. GetXLogReplayRecPtr. But I don't really see why that should be correct: read_local_xlog_page() has some kind of wait-and-retry logic and I'm not clear why we don't need the same thing here. After all, if we're on the primary and we determine that sendTimeLineHistoric = false and sendTimeLine = whatever, neither of those values can become incorrect afterward. But on a standby that can certainly happen, if an upstream server is promoted. Note that the comment in read_local_xlog_page() says this: * If that happens after our caller determined the TLI but before * we actually read the xlog page, we might still try to read from the * old (now renamed) segment and fail. There's not much we can do * about this, but it can only happen when we're a leaf of a cascading * standby whose primary gets promoted while we're decoding, so a * one-off ERROR isn't too bad. That seems to imply that the retry loop here is designed, at least in part, to protect against exactly this kind of scenario. -- Robert Haas EDB: http://www.enterprisedb.com