Re: Race condition in recovery?
От | Robert Haas |
---|---|
Тема | Re: Race condition in recovery? |
Дата | |
Msg-id | CA+TgmoY8=PtmyDWLP3gG59pifbtp3gwKfKYz1FpfqmKF25PCqQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Race condition in recovery? (Dilip Kumar <dilipbalaut@gmail.com>) |
Ответы |
Re: Race condition in recovery?
(Dilip Kumar <dilipbalaut@gmail.com>)
|
Список | pgsql-hackers |
On Fri, May 14, 2021 at 12:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I am not sure that have you noticed the commit id which changed the > definition of expectedTLEs, Heikki has committed that change so adding > him in the list to know his opinion. I did notice, but keep in mind that this was more than 8 years ago. Even if Heikki is reading this thread, he may not remember why he changed 1 line of code one way rather than another in 2013. I mean if he does that's great, but it's asking a lot. > ===== > ee994272ca50f70b53074f0febaec97e28f83c4e > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > > Delay reading timeline history file until it's fetched from master. > ..... > Without the timeline history file, recovering that file > will fail as the older timeline ID is not recognized to be an ancestor of > the target timeline. If you try to recover from such a backup, using only > streaming replication to fetch the WAL, this patch is required for that to > work. > ===== > > Part of this commit message says that it will not identify the > recoveryTargetTLI as the ancestor of the checkpoint timeline (without > history file). I did not understand what it is trying to say. Does > it is trying to say that even though the recoveryTargetTLI is the > ancestor of the checkpoint timeline but we can not track that because > we don't have a history file? So to handle this problem change the > definition of expectedTLEs to directly point to the checkpoint > timeline? > > Because before this commit, we were directly initializing expectedTLEs > with the history file of recoveryTargetTLI, we were not even waiting > for reading the checkpoint, but under this commit, it is changed. Well, I think that is talking about what the commit did in general, not specifically the one line of code that we think may be incorrect. As I understand it, the general issue here was that if XLogFileReadAnyTLI() was called before expectedTLEs got set, then prior to this commit it would have to fail, because the foreach() loop in that function would be iterating over an empty list. Heikki tried to make it not fail in that case, by setting tles = readTimeLineHistory(recoveryTargetTLI), so that the foreach loop *wouldn't* get an empty list. Thinking about this a bit more, I think the idea behind the logic this commit added to XLogFileReadAnyTLI() is that XLogFileReadAnyTLI(recoveryTargetTLI) may or may not produce the correct answer. If the timeline history file exists, it will contain all the information that we need and will return a complete list of TLEs. But if the file does not exist yet, then it will return a 1-entry list saying that the TLI in question has no parents. If readTimeLineHistory() actually reads the file, then it's safe to save the return value in expectedTLEs, but if it doesn't, then it may or may not be safe. If XLogFileReadAnyTLI calls XLogFileRead and it works, then the WAL segment we need exists on our target timeline and we don't actually need the timeline history for anything because we can just directly begin replay from the target timeline. But if XLogFileRead fails with the 1-entry dummy list, then we need the timeline history and don't have it yet, so we have to retry later, when the history file will hopefully be present, and then at that point readTimeLineHistory will return a different and better answer and hopefully it will all work. I think this is what the commit message is talking about when it says that "Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline." Without the timeline history file, we can't know whether some other timeline is an ancestor or not. But the specific way that manifests is that XLogFileReadAnyTLI() returns a 1-entry dummy list instead of the real contents of the timeline history file. This commit doesn't prevent that from happening, but it does prevent the 1-entry dummy list from getting stored in the global variable expectedTLEs, except in the case where no timeline switch is occurring and the lack of history therefore doesn't matter. Without this commit, if the call to readTimeLineHistory(recoveryTargetTLI) happens at a time when the timeline history file is not yet available, the 1-entry dummy list ends up in the global variable and there's no way for it to ever be replaced with a real history even if the timeline history file shows up in the archive later. As I see it, the question on the table here is whether there's any justification for the fact that when the second switch in WaitForWALToBecomeAvailable takes the XLOG_FROM_ARCHIVE/XLOG_FROM_PG_WAL branch, it calls XLogFileReadAnyTLI which tries to read the history of recoveryTargetTLI, while when that same switch takes the XLOG_FROM_STREAM branch, it tries to read the history of receiveTLI. I tend to think it doesn't make sense. On general principle, archiving and streaming are supposed to work the same way, so the idea that they are getting the timeline from different places is inherently suspect. But also and more specifically, AFAICS receiveTLI always has to be the same TLI that we requested from the server, so we're always looking up our own current TLI here rather than the target TLI, which seems wrong to me, at least of this moment. :-) But that having been said, I still don't quite understand the conditions required to tickle this problem. I spent a long time poking at it today. It seems to me that it ought somehow to be possible to recreate the scenario without trying to reuse the old master as a standby, and also without even needing a WAL archive, but I couldn't figure out how to do it. I tried setting up a primary and a standby, and then making a backup from the standby, promoting it, and then starting what would have been a cascading standby from the backup. But that doesn't do it. The first mistake I made was creating the standbys with something like 'pg_basebackup -R', but that's not good enough because then they have WAL, so I added '-Xnone'. But then I realized that when a base backup ends, the primary writes an XLOG_SWITCH record, which means that when the standby is promoted, the promotion is not in the same WAL segment as the checkpoint from which the machine that would have been a cascading standby is trying to start. I worked around that by setting recovery_target='immediate' on the standby. With that change, I get a WAL file on the new timeline - 2 in this case - that looks like this: rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/19000060, prev 0/19000028, desc: CHECKPOINT_ONLINE redo 0/19000028; tli 1; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 587; online rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn: 0/190000D8, prev 0/19000060, desc: BACKUP_END 0/19000028 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/19000100, prev 0/190000D8, desc: CHECKPOINT_SHUTDOWN redo 0/19000100; tli 2; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown That sure looks like the right thing to recreate the problem, because the first checkpoint is from the backup, and the woulda-been-a-cascading-standby should be starting there, and the second checkpoint is in the same segment and shows a timeline switch. But everything worked great: 2021-05-14 17:44:58.684 EDT [5697] DETAIL: End of WAL reached on timeline 1 at 0/19000100. 2021-05-14 17:44:58.728 EDT [5694] LOG: new target timeline is 2 2021-05-14 17:44:58.746 EDT [5694] LOG: redo starts at 0/19000028 2021-05-14 17:44:58.749 EDT [5694] LOG: consistent recovery state reached at 0/19000100 I don't understand why that works. It feels to me like it ought to run smack into the same problem you saw, but it doesn't. > I am referring to the below code which was deleted by this commit: > > ======== > @@ -5279,49 +5299,6 @@ StartupXLOG(void) > */ > readRecoveryCommandFile(); > > - /* Now we can determine the list of expected TLIs */ > - expectedTLEs = readTimeLineHistory(recoveryTargetTLI); > - > - /* > - * If the location of the checkpoint record is not on the expected > - * timeline in the history of the requested timeline, we cannot proceed: > - * the backup is not part of the history of the requested timeline. > - */ > - if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != > - ControlFile->checkPointCopy.ThisTimeLineID) > - { > ========= I don't think this code is really deleted. The tliOfPointInHistory check was just moved later in the function. And expectedTLEs is still supposed to be getting initialized, because just before the new location of the tliOfPointInHistory check, Heikki added Assert(expectedTLEs). -- Robert Haas EDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Ranier VilelaДата:
Сообщение: Possible memory corruption (src/timezone/zic.c b/src/timezone/zic.c)
Следующее
От: Tom LaneДата:
Сообщение: Re: Possible memory corruption (src/timezone/zic.c b/src/timezone/zic.c)