Обсуждение: Race condition in recovery?
While analyzing one of the customer issues, based on the log it appeared that there is a race condition in the recovery process. The summary of the issue is, that one of the standby is promoted as the new primary (Node2) and another standby (Node3) is restarted and set the primary_info and the restore_command so that it can stream/restore from Node2 (new primary). The problem is that during the promotion the timeline switch happened in the middle of the segment in Node2 and the Node3 is able to restore the newTLI.history file from the archive but the WAL file with the new TLI is not yet archived. Now, we will try to stream the wal file from the primary but if we are fetching the checkpoint that time we will not use the latest timeline instead we will try with the checkpoint timeline and walsender will send the WAL from the new timeline file because requested WAL recptr is before the TLI switch. And once that happened the expectedTLEs will be set based on the oldTLI.history file. Now, whenever we try to restore the required WAL file and recoveryTargetTimeLineGoal is set to the latest we again try to rescan the latest timeline (rescanLatestTimeLine) but the problem is recoveryTargetTLI was already set to the latest timeline. So now the problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set to newTLI and rescanLatestTimeLine will never update the expectedTLEs. Now, walsender will fail to stream new WAL using the old TLI and the archiver process will also fail because that file doesn't not exists anymore (converted to .partial). Basically, now we will never try with the newTLI. I have given the sequence of the events based on my analysis. Refer to the sequence of event ----------------------------------------- Node1 primary, Node2 standby1, Node3 standby2 1. Node2 got promoted to new primary, and node 2 picked new TL 13 in the middle of the segment. 2. Node3, restarted with new primary info of Node2 and restore command 3. Node3, found the newest TL13 in validateRecoveryParameters() Because the latest TL was requested in recovery.conf (history file restored from TL13) and set recoveryTargetTLI to 13 So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is not yet set. 4. Node3, entered into the standby mode. 5. Node3, tries to read the checkpoint Record, on Node3 still the checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12. 6. Node3, tries to get the checkpoint record file using new TL13 from the archive which it should get ideally but it may not if the Node2 haven't yet archived it. 7. Node3, tries to stream from primary but using TL12 because ControlFile->checkPointCopy.ThisTimeLineID is 12. 8. Node3, get it because walsender of Node2 read it from TL13 and send it and Node2 write in the new WAL file but with TL12. WalSndSegmentOpen() { /*------- * When reading from a historic timeline, and there is a timeline switch * within this segment, read from the WAL segment belonging to the new * timeline. } 9. Node3, now set the expectedTLEs to 12 because that is what walreceiver has streamed the WAL using. 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12. So whenever it tries to find the latest TLE (rescanLatestTimeLine ) it finds it is 13 which is the same as recoveryTargetTLI so nothing to change but expectedTLEs is 12 using which it will try to restore/stream further WAL and fail every time. rescanLatestTimeLine(void) { .... newtarget = findNewestTimeLine(recoveryTargetTLI); if (newtarget == recoveryTargetTLI) { /* No new timelines found */ return false; } 11. So now the situation is that ideally in rescanLatestTimeLine() we should get the latest TLI but it assumes that it is already on the latest TLI because recoveryTargetTLI is on the latest TLI. Other points to be noted: - The actual issue happened on 9.6.11 but based on the code comparison it appeared that same can occur on the latest code as well. - After Node3 is shutdown wal from its pg_wal/ directory were removed so that it can follow the new primary. Based on the sequence of events It is quite clear that something is wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we should compare it with the head of the expectedTLEs as well instead of just comparing it to the recoveryTargetTLI? Log from Node2: 2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: received promote request 2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=; SQLcode=00000 LOG: redo done at 0/F8000028 2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=; SQLcode=00000 LOG: last completed transaction was at log time 2020-12-22 04:48:01.833476+00 rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:02 UTC [1019]: [12-1] user=; db=; app=; client=; SQLcode=00000 LOG: selected new timeline ID: 13 2020-12-22 04:49:02 UTC [1019]: [13-1] user=; db=; app=; client=; SQLcode=00000 LOG: archive recovery complete Log From Node3 (with pointwise analysis): 1. Node3 restarted, restored 0000000D.history from archive and recoveryTargetTLI is set to 13 2020-12-22 04:49:40 UTC [2896]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is shut down 2020-12-22 04:49:40 UTC [2872]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is shut down 2020-12-22 04:49:41 UTC [9082]: [1-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system was shut down in recovery at 2020-12-22 04:49:40 UTC 2020-12-22 04:49:41 UTC [9082]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: creating missing WAL directory "pg_xlog/archive_status" 2020-12-22 04:49:41 UTC [9082]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000D.history" from archive rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:41 UTC [9082]: [4-1] user=; db=; app=; client=; SQLcode=00000 LOG: entering standby mode 2020-12-22 04:49:41 UTC [9082]: [5-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000D.history" from archive 2. Tries to get the WAL file with different timelines from the archive but did not get so expectedTLEs is not yet set rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000B00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000100000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 3. Since fetching the checkpoint record so use the checkpoint TLI which is 12, although primary doesn’t have 0000000C00000000000000F8 file as it renamed it to 0000000C00000000000000F8.partial But there is the logic in walsender that if requested wal is there in the current TLI then send from their so it will stream from 0000000D00000000000000F8 file 2020-12-22 04:49:42 UTC [9105]: [1-1] user=; db=; app=; client=; SQLcode=00000 LOG: fetching timeline history file for timeline 12 from primary server 2020-12-22 04:49:42 UTC [9105]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: started streaming WAL from primary at 0/F8000000 on timeline 12 2020-12-22 04:49:42 UTC [9105]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:42 UTC [9105]: [4-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. 4. Now since walreciver assumes that it has restore the WAL from the TL 12 the recieveTLI is 12 and the expectedTLEs is set base on the 0000000C.history. See below Logic in WaitForWalToBecomeAvailable if (readFile < 0) { if (!expectedTLEs) expectedTLEs = readTimeLineHistory(receiveTLI); 2020-12-22 04:49:42 UTC [9082]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000C.history" from archive rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:42 UTC [9082]: [7-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000C.history" from archive 2020-12-22 04:49:42 UTC [9082]: [8-1] user=; db=; app=; client=; SQLcode=00000 LOG: consistent recovery state reached at 0/F8000098 2020-12-22 04:49:42 UTC [9082]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: invalid record length at 0/F8000098: wanted 24, got 0 2020-12-22 04:49:42 UTC [9074]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is ready to accept read only connections 2020-12-22 04:49:42 UTC [9105]: [5-1] user=; db=; app=; client=; SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline 12 2020-12-22 04:49:42 UTC [9105]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:42 UTC [9105]: [7-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. 4. Now, expectedTLEs head is as 12 and recoveryTargetTLI is 13 so in rescanLatestTimeLine we always assume we are at the latest Ali but we try to archive from expectedTLEs which is old TLI. rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:47 UTC [9105]: [8-1] user=; db=; app=; client=; SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline 12 2020-12-22 04:49:47 UTC [9105]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:47 UTC [9105]: [10-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > 8. Node3, get it because walsender of Node2 read it from TL13 and send > it and Node2 write in the new WAL file but with TL12. > > WalSndSegmentOpen() > { > /*------- > * When reading from a historic timeline, and there is a timeline switch > * within this segment, read from the WAL segment belonging to the new > * timeline. > } > > 9. Node3, now set the expectedTLEs to 12 because that is what > walreceiver has streamed the WAL using. This seems to be incorrect, because the comment for expectedTLEs says: * expectedTLEs: a list of TimeLineHistoryEntries for recoveryTargetTLI and the timelines of * its known parents, newest first (so recoveryTargetTLI is always the * first list member). Only these TLIs are expected to be seen in the WAL * segments we read, and indeed only these TLIs will be considered as * candidate WAL files to open at all. But in your scenario apparently we end up with a situation that contradicts that, because you go on to say: > 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12. So As I understand, expectedTLEs should end up being a list where the first element is the timeline we want to end up on, and the last element is the timeline where we are now, and every timeline in the list branches off of the next timeline in the list. So here if 13 branches of 12 then the list should be 13,12 not just 12; and if 13 does not branch off of 12 OR if 13 branches off of 12 at an earlier point in the WAL stream than where we are now, then that should be an error that shuts down the standby, because then there is no way for replay to get from where it is now to the correct timeline. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 22, 2021 at 2:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > 8. Node3, get it because walsender of Node2 read it from TL13 and send > > it and Node2 write in the new WAL file but with TL12. > > > > WalSndSegmentOpen() > > { > > /*------- > > * When reading from a historic timeline, and there is a timeline switch > > * within this segment, read from the WAL segment belonging to the new > > * timeline. > > } > > > > 9. Node3, now set the expectedTLEs to 12 because that is what > > walreceiver has streamed the WAL using. > > This seems to be incorrect, because the comment for expectedTLEs says: > > * expectedTLEs: a list of TimeLineHistoryEntries for > recoveryTargetTLI and the timelines of > * its known parents, newest first (so recoveryTargetTLI is always the > * first list member). Only these TLIs are expected to be seen in the WAL > * segments we read, and indeed only these TLIs will be considered as > * candidate WAL files to open at all. > > But in your scenario apparently we end up with a situation that > contradicts that, because you go on to say: > > > 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12. So > > As I understand, expectedTLEs should end up being a list where the > first element is the timeline we want to end up on, and the last > element is the timeline where we are now, and every timeline in the > list branches off of the next timeline in the list. So here if 13 > branches of 12 then the list should be 13,12 not just 12; and if 13 > does not branch off of 12 OR if 13 branches off of 12 at an earlier > point in the WAL stream than where we are now, then that should be an > error that shuts down the standby, because then there is no way for > replay to get from where it is now to the correct timeline. Yeah, I agree with this. So IMHO the expectedTLEs should be set based on the recoveryTargetTLI instead of receiveTLI. Based on the expectedTLEs definition it can never be correct to set it based on the receiveTLI. Basically, the simple fix could be this. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b18257c198..465bc7c929 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12533,7 +12533,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); + readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false); But I am afraid that whether this adjustment (setting based on receiveTLI) is done based on some analysis or part of some bug fix. I will try to find the history of this and maybe based on that we can make a better decision. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 2, 2021 at 3:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > ===== > ee994272ca50f70b53074f0febaec97e28f83c4e > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > > Delay reading timeline history file until it's fetched from master. > > Streaming replication can fetch any missing timeline history files from the > master, but recovery would read the timeline history file for the target > timeline before reading the checkpoint record, and before walreceiver has > had a chance to fetch it from the master. Delay reading it, and the sanity > checks involving timeline history, until after reading the checkpoint > record. > > There is at least one scenario where this makes a difference: if you take > a base backup from a standby server right after a timeline switch, the > WAL segment containing the initial checkpoint record will begin with an > older timeline ID. Without the timeline history file, recovering that file > will fail as the older timeline ID is not recognized to be an ancestor of > the target timeline. If you try to recover from such a backup, using only > streaming replication to fetch the WAL, this patch is required for that to > work. > ===== The above commit avoid initializing the expectedTLEs from the recoveryTargetTLI as shown in below hunk from this commit. @@ -5279,49 +5299,6 @@ StartupXLOG(void) */ readRecoveryCommandFile(); - /* Now we can determine the list of expected TLIs */ - expectedTLEs = readTimeLineHistory(recoveryTargetTLI); - I think the fix for the problem will be that, after reading/validating the checkpoint record, we can free the current value of expectedTLEs and reinitialize it based on the recoveryTargetTLI as shown in the attached patch? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > I think the fix for the problem will be that, after reading/validating > the checkpoint record, we can free the current value of expectedTLEs > and reinitialize it based on the recoveryTargetTLI as shown in the > attached patch? I'm not sure I understand the issue here. I think that the attached should reproduce the issue mentioned here, but didn't for me. The result of running the attached test script is shown below. TLIs are adjusted in your descriptions cited below. The lines prefixed by NodeN> are the server log lines written while running the attached test script. > 1. Node2 got promoted to new primary, and node 2 picked new TL 2 in > the middle of the segment 3. Node2> LOG: selected new timeline ID: 2 > 2. Node3, restarted with new primary info of Node2 and restore command Node2> node_3 LOG: received replication command: IDENTIFY_SYSTEM > 3. Node3, found the newest TL2 in validateRecoveryParameters() Because > the latest TL was requested in recovery.conf (history file restored > from TL2) and set recoveryTargetTLI to 2 So point to note is > recoveryTargetTLI is set to 2 but expectedTLEs is not yet set. This means you specified recovery_target_timeline? Either way, expectedTLEs is not relevant to the behavior here. Even if recovery_target_timeline is set to latest, findNewestTimeLine doesn't look it. Node3> LOG: restored log file "00000002.history" from archive > 4. Node3, entered into the standby mode. Node3> LOG: entering standby mode > 5. Node3, tries to read the checkpoint Record, on Node3 still the > checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 1. expectedTLEs is loaded just before fetching the last checkpoint. ReadCheckpointRecord doesn't consider checkPointCopy.ThisTimeLineID. The reason for the checkpoint TLI is that the segment file was that of the newest TLI in expectedTLEs found in pg_wal directory. If the segment for TLI=2 containing the last checkpoint had been archived, checkpoint record would be read as TLI=2. Replication starts at TLI=2 in this case because archive recovery has reached that timeline. (Turn on the optional section in the attached test script to see this behavior.) This is the expected behavior since we assume that the segment files for TLI=n and n+1 are identical in the TLI=n part. Anyway the checkpoint that is read is on TLI=1 in this case and replication starts at TLI=1. Node3> LOG: Checkpoint record: TLI=1, 0/3014F78 > 6. Node3, tries to get the checkpoint record file using new TL2 from > the archive which it should get ideally but it may not if the Node2 > haven't yet archived it. This doesn't happen for me. Instead, node3 runs recovery from the checkpoint up to the end of the archived WAL. In this case the end point is 3014FF0@TLI=1. Node3> LOG: invalid record length at 0/3014FF0: wanted 24, got 0 Then, node3 connects to node2 requesting TLI=1 because the history file (or expectedTLEs) told that the LSN belongs to TLI=1. Node3> LOG: 0/3014FF0 is on TLI 1 Node3> LOG: started streaming WAL from primary at 0/3000000 on timeline 1 After a while node2 finds a timeline switch and disconnects the replication. Node3> LOG: replication terminated by primary server Node3> DETAIL: End of WAL reached on timeline 1 at 0/3029A68. After scanning the archive and pg_wal ends in failure, node3 correctly requests node2 for TLI=2 because expectedTLEs told that the current LSN belongs to TLI=2. Node3> LOG: 0/3029A68 is on TLI 2 Node3> LOG: restarted WAL streaming at 0/3000000 on timeline 2 Finally the items below don't happen for me, because node3 needs not to go back to the last checkpoint any longer. Perhaps the script is failing to reproduce your issue correctly. > 7. Node3, tries to stream from primary but using TL1 because > ControlFile->checkPointCopy.ThisTimeLineID is 1. As mentioned above, the checkPointCopy.ThisTimeLineID on either the primary and secondary is irrelevant to the timline the primary sends. The primary streams the timeline requested by the secondary. > 8. Node3, get it because walsender of Node2 read it from TL2 and send > it and Node2 write in the new WAL file but with TL1. Walsender strems the requested TLI from walreceiver, then disconnects at the end of the TLI notifying node3 of the next TLI. Node3 re-establishes replication with the new TLI. Looking into pg_wal of node3, segment 3 for both TLI=1 and 2 are filled by the correct content. So,,, I don't understand what are you saying is the race condition.. An issue that may be slightly relevant to this case have been raised [1]. But it is about writing end-of-recovery checkpoint into the older timeline. Could you please fix the test script so that it causes your issue correctly? And/or elaborate a bit more? The attached first file is the debugging aid logging. The second is the test script, to be placed in src/test/recovery/t. 1: https://www.postgresql.org/message-id/CAE-ML%2B_EjH_fzfq1F3RJ1%3DXaaNG%3D-Jz-i3JqkNhXiLAsM3z-Ew%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..9483fd055c 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end) for (tli = begin; tli < end; tli++) { + elog(LOG, "Trying restoring history file for TLI=%d", tli); if (tli == 1) continue; @@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Trying reading history file for TLI=%d", targetTLI); TLHistoryFileName(histfname, targetTLI); fromArchive = RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); @@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Probing history file for TLI=%d", probeTLI); TLHistoryFileName(histfname, probeTLI); RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index adfc6f67e2..e31e1f0ce3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3732,6 +3732,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli, snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", xlogfname); set_ps_display(activitymsg); + elog(LOG, "Trying fetching history file for TLI=%d", tli); restoredFromArchive = RestoreArchivedFile(path, xlogfname, "RECOVERYXLOG", wal_segment_size, @@ -3825,7 +3826,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) if (expectedTLEs) tles = expectedTLEs; else + { + elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI); tles = readTimeLineHistory(recoveryTargetTLI); + } foreach(cell, tles) { @@ -3839,6 +3843,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) * Skip scanning the timeline ID that the logfile segment to read * doesn't belong to */ + elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source); + if (hent->begin != InvalidXLogRecPtr) { XLogSegNo beginseg = 0; @@ -3865,6 +3871,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_ARCHIVE, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from archive", segno, tli); elog(DEBUG1, "got WAL segment from archive"); if (!expectedTLEs) expectedTLEs = tles; @@ -3878,6 +3885,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_PG_WAL, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli); if (!expectedTLEs) expectedTLEs = tles; return fd; @@ -8421,7 +8429,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, XLogBeginRead(xlogreader, RecPtr); record = ReadRecord(xlogreader, LOG, true); - + elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli, LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs); if (record == NULL) { if (!report) @@ -12628,7 +12636,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * TLI, rather than the position we're reading. */ tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - + elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli); if (curFileTLI > 0 && tli < curFileTLI) elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", LSN_FORMAT_ARGS(tliRecPtr), @@ -12697,7 +12705,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) + { + elog(LOG, "Loading expectedTLEs for %d", receiveTLI); expectedTLEs = readTimeLineHistory(receiveTLI); + } readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false); # Minimal test testing streaming replication use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 2; my $primary = get_new_node('primary'); $primary->init(allows_streaming => 1); $primary->start; my $backup_name = 'my_backup'; $primary->backup($backup_name); my $node_2 = get_new_node('node_2'); $node_2->init_from_backup($primary, $backup_name, has_streaming => 1, allows_streaming => 1); $node_2->append_conf('postgresql.conf', "archive_mode = always"); my $archdir = $node_2->data_dir . "/../archive"; $node_2->append_conf('postgresql.conf', "archive_command = 'cp %p $archdir/%f'"); mkdir($archdir); $node_2->start; # Create streaming standby linking to primary my $node_3 = get_new_node('node_3'); $node_3->init_from_backup($primary, $backup_name, has_streaming => 1); $node_3->append_conf('postgresql.conf', "restore_command = 'cp $archdir/%f %p'"); $node_3->start; $primary->psql('postgres', 'SELECT pg_switch_wal(); CREATE TABLE t(); DROP TABLE t; CHECKPOINT;'); $primary->wait_for_catchup($node_2, 'write', $primary->lsn('insert')); $primary->wait_for_catchup($node_3, 'write', $primary->lsn('insert')); $node_3->stop; # put node3 a bit behind to cause streaming on the old timeline $primary->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $primary->wait_for_catchup($node_2, 'write', $primary->lsn('insert')); $primary->stop; # promote node2 $node_2->psql('postgres', 'SELECT pg_promote()'); # optional: archive segment 3 of TLI=2 on node2 and advance one more segment if (0) { my $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver"); $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver"); } # attach node3 as a standby of node2 $node_3->enable_streaming($node_2); $node_3->append_conf('postgresql.conf', "recovery_target_timeline = 2"); # *restart# node3, not just reloading to trigger archive recovery $node_3->start; $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); # XXX: another defect comes out without this X( $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->wait_for_catchup($node_3, 'write', $node_2->lsn('insert')); my $result = $node_2->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication'); ok($result eq 't', 'check'); # set 0 to leave data directories after a successful run ok(1, 'break');
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > Could you please fix the test script so that it causes your issue > correctly? And/or elaborate a bit more? > > The attached first file is the debugging aid logging. The second is > the test script, to be placed in src/test/recovery/t. I will look into your test case and try to see whether we can reproduce the issue. But let me summarise what is the exact issue. Basically, the issue is that first in validateRecoveryParameters if the recovery target is the latest then we fetch the latest history file and set the recoveryTargetTLI timeline to the latest available timeline assume it's 2 but we delay updating the expectedTLEs (as per commit ee994272ca50f70b53074f0febaec97e28f83c4e). Now, while reading the checkpoint record if we don't get the required WAL from the archive then we try to get from primary, and while getting checkpoint from primary we use "ControlFile->checkPointCopy.ThisTimeLineID" suppose that is older timeline 1. Now after reading the checkpoint we will set the expectedTLEs based on the timeline from which we got the checkpoint record. See below Logic in WaitForWalToBecomeAvailable if (readFile < 0) { if (!expectedTLEs) expectedTLEs = readTimeLineHistory(receiveTLI); Now, the first problem is we are breaking the sanity of expectedTLEs because as per the definition it should already start with recoveryTargetTLI but it is starting with the older TLI. Now, in rescanLatestTimeLine we are trying to fetch the latest TLI which is still 2, so this logic returns without reinitializing the expectedTLEs because it assumes that if recoveryTargetTLI is pointing to 2 then expectedTLEs must be correct and need not be changed. See below logic: rescanLatestTimeLine(void) { .... newtarget = findNewestTimeLine(recoveryTargetTLI); if (newtarget == recoveryTargetTLI) { /* No new timelines found */ return false; } ... newExpectedTLEs = readTimeLineHistory(newtarget); ... expectedTLEs = newExpectedTLEs; Solution: 1. Find better way to fix the problem of commit (ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the sanity of expectedTLEs. 2. Assume, we have to live with fix 1 and we have to initialize expectedTLEs with an older timeline for validating the checkpoint in absence of tl.hostory file (as this commit claims). Then as soon as we read and validate the checkpoint, fix the expectedTLEs and set it based on the history file of recoveryTargetTLI. Does this explanation make sense? If not please let me know what part is not clear in the explanation so I can point to that code. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Thanks. At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > Could you please fix the test script so that it causes your issue > > correctly? And/or elaborate a bit more? > > > > The attached first file is the debugging aid logging. The second is > > the test script, to be placed in src/test/recovery/t. > > I will look into your test case and try to see whether we can > reproduce the issue. But let me summarise what is the exact issue. > Basically, the issue is that first in validateRecoveryParameters if > the recovery target is the latest then we fetch the latest history > file and set the recoveryTargetTLI timeline to the latest available > timeline assume it's 2 but we delay updating the expectedTLEs (as per > commit ee994272ca50f70b53074f0febaec97e28f83c4e). Now, while reading I think it is right up to here. > the checkpoint record if we don't get the required WAL from the > archive then we try to get from primary, and while getting checkpoint > from primary we use "ControlFile->checkPointCopy.ThisTimeLineID" > suppose that is older timeline 1. Now after reading the checkpoint we > will set the expectedTLEs based on the timeline from which we got the > checkpoint record. I doubt this point. ReadCheckpointRecord finally calls XLogFileReadAnyTLI and it uses the content of the 00000002.history as the local timeline entry list, since expectedTLEs is NIL and recoveryTargetTLI has been updated to 2 by validateRecoveryParameters(). But node 3 was having only the segment on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod on TLI=1. XLogFileReadAnyTLI() copies the local TLE list based on TLI=2 to expectedTLEs just after that because the wanted checkpoint record was available based on the list. So I don't think checkPointCopy.ThisTimeLineID cannot affect this logic, and don't think expectedTLEs is left with NIL. It's helpful that you could show the specific code path to cause that. > See below Logic in WaitForWalToBecomeAvailable > if (readFile < 0) > { > if (!expectedTLEs) > expectedTLEs = readTimeLineHistory(receiveTLI); > > Now, the first problem is we are breaking the sanity of expectedTLEs > because as per the definition it should already start with > recoveryTargetTLI but it is starting with the older TLI. Now, in If my description above is correct, expectedTLEs has been always filled by TLI=2's hisotry so readTimeLineHistory is not called there. After that the things are working as described in my previous mail. So The following is not an issue if I'm not missing something. > rescanLatestTimeLine we are trying to fetch the latest TLI which is > still 2, so this logic returns without reinitializing the expectedTLEs > because it assumes that if recoveryTargetTLI is pointing to 2 then > expectedTLEs must be correct and need not be changed. > > See below logic: > rescanLatestTimeLine(void) > { > .... > newtarget = findNewestTimeLine(recoveryTargetTLI); > if (newtarget == recoveryTargetTLI) > { > /* No new timelines found */ > return false; > } > ... > newExpectedTLEs = readTimeLineHistory(newtarget); > ... > expectedTLEs = newExpectedTLEs; > > > Solution: > 1. Find better way to fix the problem of commit > (ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the > sanity of expectedTLEs. > 2. Assume, we have to live with fix 1 and we have to initialize > expectedTLEs with an older timeline for validating the checkpoint in > absence of tl.hostory file (as this commit claims). Then as soon as > we read and validate the checkpoint, fix the expectedTLEs and set it > based on the history file of recoveryTargetTLI. > > Does this explanation make sense? If not please let me know what part > is not clear in the explanation so I can point to that code. So, unfortunately not. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, May 7, 2021 at 2:33 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thanks. > > > At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > > Could you please fix the test script so that it causes your issue > > > correctly? And/or elaborate a bit more? > > > > > > The attached first file is the debugging aid logging. The second is > > > the test script, to be placed in src/test/recovery/t. > > > > I will look into your test case and try to see whether we can > > reproduce the issue. But let me summarise what is the exact issue. > > Basically, the issue is that first in validateRecoveryParameters if > > the recovery target is the latest then we fetch the latest history > > file and set the recoveryTargetTLI timeline to the latest available > > timeline assume it's 2 but we delay updating the expectedTLEs (as per > > commit ee994272ca50f70b53074f0febaec97e28f83c4e). Now, while reading > > I think it is right up to here. > > > the checkpoint record if we don't get the required WAL from the > > archive then we try to get from primary, and while getting checkpoint > > from primary we use "ControlFile->checkPointCopy.ThisTimeLineID" > > suppose that is older timeline 1. Now after reading the checkpoint we > > will set the expectedTLEs based on the timeline from which we got the > > checkpoint record. > > I doubt this point. ReadCheckpointRecord finally calls > XLogFileReadAnyTLI and it uses the content of the 00000002.history as > the local timeline entry list, since expectedTLEs is NIL and > recoveryTargetTLI has been updated to 2 by > validateRecoveryParameters(). But node 3 was having only the segment > on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod > on TLI=1. XLogFileReadAnyTLI() copies the local TLE list based on > TLI=2 to expectedTLEs just after that because the wanted checkpoint > record was available based on the list. Okay, I got your point, now, consider the scenario that we are trying to get the checkpoint record in XLogFileReadAnyTLI, you are right that it returns history file 00000002.history. I think I did not mention one point, basically, the tool while restarting node 3 after promoting node 2 is deleting all the local WAL of node3 (so that node 3 can follow node2). So now node3 doesn't have the checkpoint in the local segment. Suppose checkpoint record was in segment 000000010000000000000001, but after TL switch 000000010000000000000001 is renamed to 000000010000000000000001.partial on node2 so now practically doesn't have 000000010000000000000001 file anywhere. However if TL switch mid-segment then we copy that segment with new TL so we have 000000020000000000000001 which contains the checkpoint record, but node 2 haven't yet archived it. So now you come out of XLogFileReadAnyTLI, without reading checkpoint record and without setting expectedTLEs. Because expectedTLEs is only set if we are able to read the checkpoint record. Make sense? > So I don't think checkPointCopy.ThisTimeLineID cannot affect this > logic, and don't think expectedTLEs is left with NIL. It's helpful > that you could show the specific code path to cause that. So now expectedTLEs is still NULL and you go to get the checkpoint record from primary and use checkPointCopy.ThisTimeLineID. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Fri, 7 May 2021 15:16:03 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > Okay, I got your point, now, consider the scenario that we are trying > to get the checkpoint record in XLogFileReadAnyTLI, you are right that > it returns history file 00000002.history. I think I did not mention > one point, basically, the tool while restarting node 3 after promoting > node 2 is deleting all the local WAL of node3 (so that node 3 can > follow node2). So now node3 doesn't have the checkpoint in the local > segment. Suppose checkpoint record was in segment ... > So now you come out of XLogFileReadAnyTLI, without reading checkpoint > record and without setting expectedTLEs. Because expectedTLEs is only > set if we are able to read the checkpoint record. Make sense? Thanks. I understood the case and reproduced. Although I don't think removing WAL files from non-backup cluster is legit, I also think we can safely start archive recovery from a replicated segment. > So now expectedTLEs is still NULL and you go to get the checkpoint > record from primary and use checkPointCopy.ThisTimeLineID. I don't think erasing expectedTLEs after once set is the right fix because expectedTLEs are supposed to be set just once iff we are sure that we are going to follow the history, until rescan changes it as the only exception. It seems to me the issue here is not a race condition but WaitForWALToBecomeAvailable initializing expectedTLEs with the history of a improper timeline. So using recoveryTargetTLI instead of receiveTLI for the case fixes this issue. - if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); I thought that the reason using receiveTLI instead of recoveryTargetTLI here is that there's a case where receiveTLI is the future of recoveryTarrgetTLI but I haven't successfully had such a situation. If I set recovoryTargetTLI to a TLI that standby doesn't know but primary knows, validateRecoveryParameters immediately complains about that before reaching there. Anyway the attached assumes receiveTLI may be the future of recoveryTargetTLI. Just inserting if() into the exising code makes the added lines stick out of the right side edge of 80 columns so I refactored there a bit to lower indentation. I believe the 004_timeline_switch.pl detects your issue. And the attached change fixes it. Any suggestions are welcome. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 52eba6b6ef8d8fda078d3acd27b6ce7406078df8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Mon, 10 May 2021 16:57:24 +0900 Subject: [PATCH] Choose correct timeline when received historic timelines When a standby starts streaming before determining expectedTLEs, it determines the expected timeline list based on the timeline of the WAL segment just streamed from the primary. If we fetched the last checkpoint in the older timeline, this behavior prevents recovery from proceeding. When primary streamed over a WAL file in a historical timeline, use recoveryTargetTLI, which must be a history of the primary. There's one scenario where this makes a difference: suppose two standbys are connected to a primary both by archive and streaming. In the case where one of the standby promotes, then another reconnects to the promoted standby before archiving the first segment of the new timeline, and the content of pg_wal of the new standby is removed before reconnection, the standby fetches the history file for the new timeline but the first segment for the timeline is available only via streaming. In this case, the new standby thought that the primary always sends newer timeline than the current recovery target but that is not the case of this scenario. --- src/backend/access/transam/xlog.c | 45 ++++++++++--- src/test/recovery/t/004_timeline_switch.pl | 74 +++++++++++++++++++++- 2 files changed, 108 insertions(+), 11 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c1d4415a43..1ca55b7ec0 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12656,22 +12656,47 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * info is set correctly and XLogReceiptTime isn't * changed. */ - if (readFile < 0) - { - if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); - readFile = XLogFileRead(readSegNo, PANIC, - receiveTLI, - XLOG_FROM_STREAM, false); - Assert(readFile >= 0); - } - else + + if (readFile >= 0) { /* just make sure source info is correct... */ readSource = XLOG_FROM_STREAM; XLogReceiptSource = XLOG_FROM_STREAM; return true; } + + readFile = XLogFileRead(readSegNo, PANIC, + receiveTLI, + XLOG_FROM_STREAM, false); + Assert(readFile >= 0); + + if (expectedTLEs) + break; + + /* initialize expectedTLEs */ + expectedTLEs = readTimeLineHistory(receiveTLI); + + if (!tliInHistory(recoveryTargetTLI, expectedTLEs)) + { + /* + * Received timeline is not the future of the + * recovery target timeline. The upstream must have + * instead sent a historic timeline of ours. This + * can happen when this standby's pg_wal has been + * cleaned up before restart but archive offered + * the history file recovery target. Pick up the + * history of the recovery target timeline. + */ + expectedTLEs = + readTimeLineHistory(recoveryTargetTLI); + if (!tliInHistory(receiveTLI, expectedTLEs)) + ereport(FATAL, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg ("the primary server has sent incompatible timeline %d with recovery targettimeline %d", + receiveTLI, + recoveryTargetTLI))); + } + break; } diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index c101980e9e..bca7ad165e 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -7,7 +7,7 @@ use warnings; use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 3; +use Test::More tests => 4; $ENV{PGDATABASE} = 'postgres'; @@ -109,3 +109,75 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay', my $result_2 = $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int"); is($result_2, qq(1), 'check content of standby 3'); + +##### +# Check if standby can follow primary when the segment for the new +# time line is not avaiable for the standby in archive. + +# setup two standby nodes, one of them archives WAL files +$node_primary_2->psql('postgres', 'CREATE TABLE t (a int)'); +my $node_standby_4 = get_new_node('standby_3_1'); +$node_standby_4->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1, allows_streaming => 1, has_archiving => 1); +my $archdir = $node_standby_4->archive_dir; + +# keep 3 segments +my $keepsize = $node_primary_2->safe_psql('postgres', + q[ +select setting from pg_settings where name = 'wal_segment_size'; + ]) * 3; +$node_standby_4->append_conf('postgresql.conf', qq[ +wal_keep_size = $keepsize +archive_mode = always +archive_command = 'cp %p $archdir/%f' +]); + +$node_standby_4->start; + +my $node_standby_5 = get_new_node('node_standby_5'); +$node_standby_5->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1); +$node_standby_5->start; + +$node_primary_2->psql('postgres', qq[ + SELECT pg_switch_wal(); + INSERT INTO t VALUES (0); + CHECKPOINT; +]); +$node_primary_2->wait_for_catchup($node_standby_4, 'write', + $node_primary_2->lsn('insert')); +$node_primary_2->wait_for_catchup($node_standby_5, 'write', + $node_primary_2->lsn('insert')); + +# disconnect the second standby then connect to the promoted first standby +$node_standby_5->stop; +$node_standby_4->psql('postgres', 'SELECT pg_promote()'); +$node_standby_5->enable_streaming($node_standby_4); +$node_standby_5->append_conf('postgresql.conf', + "restore_command = 'cp $archdir/%f %p'"); + +# flush pg_wal on the second standby (new standby) +my $pgwaldir = $node_standby_5->data_dir. "/pg_wal"; +opendir my $dh, $pgwaldir or die "failed to open $pgwaldir of node_standby_5"; +while (my $f = readdir($dh)) +{ + unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f"); +} +closedir($dh); + +# restart the second standby, not just reloading to trigger archive recovery +$node_standby_5->start; +$node_standby_4->psql('postgres', qq[ + INSERT INTO t VALUES(0); + SELECT pg_switch_wal(); + INSERT INTO t VALUES(0); -- avoid segment border +]); + +$node_standby_4->wait_for_catchup($node_standby_5, 'write', + $node_standby_4->lsn('insert')); + +# check if the change is correctly replicated +my $result_3 = + $node_standby_4->safe_psql('postgres', + 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication'); +is($result_3, 't', 'check receiving historic timeline from primary'); -- 2.27.0
On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > I thought that the reason using receiveTLI instead of > recoveryTargetTLI here is that there's a case where receiveTLI is the > future of recoveryTarrgetTLI but I haven't successfully had such a > situation. If I set recovoryTargetTLI to a TLI that standby doesn't > know but primary knows, validateRecoveryParameters immediately > complains about that before reaching there. Anyway the attached > assumes receiveTLI may be the future of recoveryTargetTLI. If you see the note in this commit. It says without the timeline history file, so does it trying to say that although receiveTLI is the ancestor of recovoryTargetTLI, it can not detect that because of the absence of the TL.history file ? ee994272ca50f70b53074f0febaec97e28f83c4e Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 ..... Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline. If you try to recover from such a backup, using only streaming replication to fetch the WAL, this patch is required for that to work. ===== > > I believe the 004_timeline_switch.pl detects your issue. And the > attached change fixes it. I think this fix looks better to me, but I will think more about it and give my feedback. Thanks for quickly coming up with the reproducible test case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Mon, 10 May 2021 14:27:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > I thought that the reason using receiveTLI instead of > > recoveryTargetTLI here is that there's a case where receiveTLI is the > > future of recoveryTarrgetTLI but I haven't successfully had such a > > situation. If I set recovoryTargetTLI to a TLI that standby doesn't > > know but primary knows, validateRecoveryParameters immediately > > complains about that before reaching there. Anyway the attached > > assumes receiveTLI may be the future of recoveryTargetTLI. > > If you see the note in this commit. It says without the timeline > history file, so does it trying to say that although receiveTLI is the > ancestor of recovoryTargetTLI, it can not detect that because of the > absence of the TL.history file ? Yeah, it reads so for me and it works as described. What I don't understand is that why the patch uses receiveTLI, not recovoryTargetTLI to load timeline hisotry in WaitForWALToBecomeAvailable. The only possible reason is that there could be a case where receivedTLI is the future of recoveryTargetTLI. However, AFAICS it's impossible for that case to happen. At replication start, requsting TLI is that of the last checkpoint, which is the same to recoveryTargetTLI, or anywhere in exising expectedTLEs which must be the past of recoveryTargetTLI. That seems to be already true at the time replication was made possible to follow a timeline switch (abfd192b1b). So I was tempted to just load history for recoveryTargetTLI then confirm that receiveTLI is in the history. Actually that change doesn't harm any of the recovery TAP tests. It is way simpler than the last patch. However, I'm not confident that it is right.. ;( > ee994272ca50f70b53074f0febaec97e28f83c4e > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > ..... > Without the timeline history file, recovering that file > will fail as the older timeline ID is not recognized to be an ancestor of > the target timeline. If you try to recover from such a backup, using only > streaming replication to fetch the WAL, this patch is required for that to > work. > ===== > > > > > I believe the 004_timeline_switch.pl detects your issue. And the > > attached change fixes it. > > I think this fix looks better to me, but I will think more about it > and give my feedback. Thanks for quickly coming up with the > reproducible test case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, May 11, 2021 at 1:42 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 10 May 2021 14:27:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > I thought that the reason using receiveTLI instead of > > > recoveryTargetTLI here is that there's a case where receiveTLI is the > > > future of recoveryTarrgetTLI but I haven't successfully had such a > > > situation. If I set recovoryTargetTLI to a TLI that standby doesn't > > > know but primary knows, validateRecoveryParameters immediately > > > complains about that before reaching there. Anyway the attached > > > assumes receiveTLI may be the future of recoveryTargetTLI. > > > > If you see the note in this commit. It says without the timeline > > history file, so does it trying to say that although receiveTLI is the > > ancestor of recovoryTargetTLI, it can not detect that because of the > > absence of the TL.history file ? > > Yeah, it reads so for me and it works as described. What I don't > understand is that why the patch uses receiveTLI, not > recovoryTargetTLI to load timeline hisotry in > WaitForWALToBecomeAvailable. The only possible reason is that there > could be a case where receivedTLI is the future of recoveryTargetTLI. > However, AFAICS it's impossible for that case to happen. At > replication start, requsting TLI is that of the last checkpoint, which > is the same to recoveryTargetTLI, or anywhere in exising expectedTLEs > which must be the past of recoveryTargetTLI. That seems to be already > true at the time replication was made possible to follow a timeline > switch (abfd192b1b). > > So I was tempted to just load history for recoveryTargetTLI then > confirm that receiveTLI is in the history. Actually that change > doesn't harm any of the recovery TAP tests. It is way simpler than > the last patch. However, I'm not confident that it is right.. ;( I first thought of fixing like as you describe that instead of loading history of receiveTLI, load history for recoveryTargetTLI. But then, this commit (ee994272ca50f70b53074f0febaec97e28f83c4e) has especially used the history file of receiveTLI to solve a particular issue which I did not clearly understand. I am not sure that whether it is a good idea to directly using recoveryTargetTLI, without exactly understanding why this commit was using receiveTLI. It doesn't seem like an oversight to me, it seems intentional. Maybe Heikki can comment on this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > It seems to me the issue here is not a race condition but > WaitForWALToBecomeAvailable initializing expectedTLEs with the history > of a improper timeline. So using recoveryTargetTLI instead of > receiveTLI for the case fixes this issue. I agree. > I believe the 004_timeline_switch.pl detects your issue. And the > attached change fixes it. So why does this use recoveryTargetTLI instead of receiveTLI only conditionally? Why not do it all the time? The hard thing about this code is that the assumptions are not very clear. If we don't know why something is a certain way, then we might break things if we change it. Worse yet, if nobody else knows why it's like that either, then who knows what assumptions they might be making? It's hard to be sure that any change is safe. But that being said, we have a clear definition from the comments for what expectedTLEs is supposed to contain, and it's only going to end up with those contents if we initialize it from recoveryTargetTLI. So I am inclined to think that we ought to do that always, and if it breaks something, then that's a sign that some other part of the code also needs fixing, because apparently that hypothetical other part of the code doesn't work if expctedTLEs contains what the comments say that it should. Now maybe that's the wrong idea. But if so, then we're saying that the definition of expectedTLEs needs to be changed, and we should update the comments with the new definition, whatever it is. A lot of the confusion here results from the fact that the code and comments are inconsistent and we can't tell whether that's intentional or inadvertent. Let's not leave the next person who looks at this code wondering the same thing about whatever changes we make. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 14, 2021 at 2:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > > So why does this use recoveryTargetTLI instead of receiveTLI only > conditionally? Why not do it all the time? > > The hard thing about this code is that the assumptions are not very > clear. If we don't know why something is a certain way, then we might > break things if we change it. Worse yet, if nobody else knows why it's > like that either, then who knows what assumptions they might be > making? It's hard to be sure that any change is safe. > > But that being said, we have a clear definition from the comments for > what expectedTLEs is supposed to contain, and it's only going to end > up with those contents if we initialize it from recoveryTargetTLI. So > I am inclined to think that we ought to do that always, and if it > breaks something, then that's a sign that some other part of the code > also needs fixing, because apparently that hypothetical other part of > the code doesn't work if expctedTLEs contains what the comments say > that it should. > > Now maybe that's the wrong idea. But if so, then we're saying that the > definition of expectedTLEs needs to be changed, and we should update > the comments with the new definition, whatever it is. A lot of the > confusion here results from the fact that the code and comments are > inconsistent and we can't tell whether that's intentional or > inadvertent. Let's not leave the next person who looks at this code > wondering the same thing about whatever changes we make. I am not sure that have you noticed the commit id which changed the definition of expectedTLEs, Heikki has committed that change so adding him in the list to know his opinion. ===== ee994272ca50f70b53074f0febaec97e28f83c4e Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 Delay reading timeline history file until it's fetched from master. ..... Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline. If you try to recover from such a backup, using only streaming replication to fetch the WAL, this patch is required for that to work. ===== Part of this commit message says that it will not identify the recoveryTargetTLI as the ancestor of the checkpoint timeline (without history file). I did not understand what it is trying to say. Does it is trying to say that even though the recoveryTargetTLI is the ancestor of the checkpoint timeline but we can not track that because we don't have a history file? So to handle this problem change the definition of expectedTLEs to directly point to the checkpoint timeline? Because before this commit, we were directly initializing expectedTLEs with the history file of recoveryTargetTLI, we were not even waiting for reading the checkpoint, but under this commit, it is changed. I am referring to the below code which was deleted by this commit: ======== @@ -5279,49 +5299,6 @@ StartupXLOG(void) */ readRecoveryCommandFile(); - /* Now we can determine the list of expected TLIs */ - expectedTLEs = readTimeLineHistory(recoveryTargetTLI); - - /* - * If the location of the checkpoint record is not on the expected - * timeline in the history of the requested timeline, we cannot proceed: - * the backup is not part of the history of the requested timeline. - */ - if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != - ControlFile->checkPointCopy.ThisTimeLineID) - { ========= -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > It seems to me the issue here is not a race condition but > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history > > of a improper timeline. So using recoveryTargetTLI instead of > > receiveTLI for the case fixes this issue. > > I agree. > > > I believe the 004_timeline_switch.pl detects your issue. And the > > attached change fixes it. > > So why does this use recoveryTargetTLI instead of receiveTLI only > conditionally? Why not do it all the time? The commit ee994272ca apparently says that there's a case where primary > The hard thing about this code is that the assumptions are not very > clear. If we don't know why something is a certain way, then we might > break things if we change it. Worse yet, if nobody else knows why it's > like that either, then who knows what assumptions they might be > making? It's hard to be sure that any change is safe. Thanks for the comment. > But that being said, we have a clear definition from the comments for > what expectedTLEs is supposed to contain, and it's only going to end > up with those contents if we initialize it from recoveryTargetTLI. So > I am inclined to think that we ought to do that always, and if it Yes, I also found it after that, and agreed. Desynchronization between recoveryTargetTLI and expectedTLEs prevents rescanLatestTimeline from working. > breaks something, then that's a sign that some other part of the code > also needs fixing, because apparently that hypothetical other part of > the code doesn't work if expctedTLEs contains what the comments say > that it should. After some more inspection, I'm now also sure that it is a typo/thinko. Other than while fetching the first checkpoint, receivedTLI is always in the history of recoveryTargetTLI, otherwise recoveryTargetTLI is equal to receiveTLI. I read that the commit message as "waiting for fetching possible future history files to know if there's the future for the current timeline. However now I read it as "don't bother expecting for possiblly-unavailable history files when we're reading the first checkpoint the timeline for which is already known to us.". If it is correct we don't bother considering future history files coming from primary there. > Now maybe that's the wrong idea. But if so, then we're saying that the > definition of expectedTLEs needs to be changed, and we should update > the comments with the new definition, whatever it is. A lot of the > confusion here results from the fact that the code and comments are > inconsistent and we can't tell whether that's intentional or > inadvertent. Let's not leave the next person who looks at this code > wondering the same thing about whatever changes we make. Ok. The reason why we haven't have a complain about that would be that it is rare that pg_wal is wiped out before a standby connects to a just-promoted primary. I'm not sure about the tool Dilip is using, though.. So the result is the attached. This would be back-patcheable to 9.3 (or 9.6?) but I doubt that we should do as we don't seem to have had a complaint on this issue and we're not full faith on this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 94e695031e7aa78670b1d0fd63f6cfed0d501611 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 14 May 2021 13:51:01 +0900 Subject: [PATCH v2] Set expectedTLEs correctly based on recoveryTargetTLI When a standby starts streaming before it determines expectedTLEs, it is set based on the timeline of the WAL segment just streamed from the primary. If we fetch the last checkpoint in the older timeline, this behavior leads to setting expectedTLEs based on the older timeline than recoveryTargetTLI and later calls to rescanLatestTimeLine don't any longer update recoveryTargetTLI and expectedTLEs and the standby stays frozen at the point. This behavior has been introduced by commit ee994272ca but there's no explanation about breaking the defined relationship between the two variables, and there seems not to be a case the behavior is useful. Make things consistent by fixing WaitForWALToBecomeAvailable to set expectedTLEs not using receiveTLI but recoveryTargetTLI. --- src/backend/access/transam/xlog.c | 3 +- src/test/recovery/t/004_timeline_switch.pl | 72 +++++++++++++++++++++- 2 files changed, 73 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8d163f190f..ef8b6d2c8d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12662,7 +12662,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); + expectedTLEs = + readTimeLineHistory(recoveryTargetTLI); readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false); diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index c101980e9e..2c8b73f5f0 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -7,7 +7,7 @@ use warnings; use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 3; +use Test::More tests => 4; $ENV{PGDATABASE} = 'postgres'; @@ -109,3 +109,73 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay', my $result_2 = $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int"); is($result_2, qq(1), 'check content of standby 3'); + +###### +# Check if recoveryTargetTLI is correctly updated when we start +# replication from the checkpoint in the previous timeline of +# just-promoted primary, with having pg_wal wiped out. + +# setup two standby nodes, one of them archives WAL files +$node_primary_2->psql('postgres', 'CREATE TABLE t (a int)'); +my $node_standby_4 = get_new_node('standby_4'); +$node_standby_4->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1, allows_streaming => 1, has_archiving => 1); + +$node_standby_4->enable_archiving; +my $archdir = $node_standby_4->archive_dir; + +# keep segments, and enable archive on the standby +$node_standby_4->append_conf('postgresql.conf', qq[ +wal_keep_size = 512MB +archive_mode = always +]); + +$node_standby_4->start; + +my $node_standby_5 = get_new_node('node_standby_5'); +$node_standby_5->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1); +$node_standby_5->start; + +$node_primary_2->psql('postgres', qq[ + SELECT pg_switch_wal(); + INSERT INTO t VALUES (0); + CHECKPOINT; +]); +$node_primary_2->wait_for_catchup($node_standby_4, 'write', + $node_primary_2->lsn('insert')); +$node_primary_2->wait_for_catchup($node_standby_5, 'write', + $node_primary_2->lsn('insert')); + +# disconnect the second standby then connect to the promoted first standby +$node_standby_5->stop; +$node_standby_4->psql('postgres', 'SELECT pg_promote()'); +$node_standby_5->enable_streaming($node_standby_4); +$node_standby_5->append_conf('postgresql.conf', + "restore_command = 'cp $archdir/%f %p'"); + +# wipe-out pg_wal on the second standby (new standby) +my $pgwaldir = $node_standby_5->data_dir. "/pg_wal"; +opendir my $dh, $pgwaldir or die "failed to open $pgwaldir of node_standby_5"; +while (my $f = readdir($dh)) +{ + unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f"); +} +closedir($dh); + +# restart the second standby, not just reloading to trigger archive recovery +$node_standby_5->start; +$node_standby_4->psql('postgres', qq[ + INSERT INTO t VALUES(0); + SELECT pg_switch_wal(); + INSERT INTO t VALUES(0); -- avoid segment border +]); + +$node_standby_4->wait_for_catchup($node_standby_5, 'write', + $node_standby_4->lsn('insert')); + +# check if the change is correctly replicated +my $result_3 = + $node_standby_4->safe_psql('postgres', + 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication'); +is($result_3, 't', 'check receiving historic timeline from primary'); -- 2.27.0
At Fri, 14 May 2021 14:12:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > It seems to me the issue here is not a race condition but > > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history > > > of a improper timeline. So using recoveryTargetTLI instead of > > > receiveTLI for the case fixes this issue. > > > > I agree. > > > > > I believe the 004_timeline_switch.pl detects your issue. And the > > > attached change fixes it. > > > > So why does this use recoveryTargetTLI instead of receiveTLI only > > conditionally? Why not do it all the time? > > The commit ee994272ca apparently says that there's a case where primary This is not an incomplete line but just a garbage. > > The hard thing about this code is that the assumptions are not very > > clear. If we don't know why something is a certain way, then we might > > break things if we change it. Worse yet, if nobody else knows why it's > > like that either, then who knows what assumptions they might be > > making? It's hard to be sure that any change is safe. > > Thanks for the comment. > > > But that being said, we have a clear definition from the comments for > > what expectedTLEs is supposed to contain, and it's only going to end > > up with those contents if we initialize it from recoveryTargetTLI. So > > I am inclined to think that we ought to do that always, and if it > > Yes, I also found it after that, and agreed. Desynchronization > between recoveryTargetTLI and expectedTLEs prevents > rescanLatestTimeline from working. > > > breaks something, then that's a sign that some other part of the code > > also needs fixing, because apparently that hypothetical other part of > > the code doesn't work if expctedTLEs contains what the comments say > > that it should. > > After some more inspection, I'm now also sure that it is a > typo/thinko. Other than while fetching the first checkpoint, > receivedTLI is always in the history of recoveryTargetTLI, otherwise > recoveryTargetTLI is equal to receiveTLI. > > I read that the commit message as "waiting for fetching possible > future history files to know if there's the future for the current > timeline. However now I read it as "don't bother expecting for > possiblly-unavailable history files when we're reading the first > checkpoint the timeline for which is already known to us.". If it is > correct we don't bother considering future history files coming from > primary there. > > > Now maybe that's the wrong idea. But if so, then we're saying that the > > definition of expectedTLEs needs to be changed, and we should update > > the comments with the new definition, whatever it is. A lot of the > > confusion here results from the fact that the code and comments are > > inconsistent and we can't tell whether that's intentional or > > inadvertent. Let's not leave the next person who looks at this code > > wondering the same thing about whatever changes we make. > > Ok. The reason why we haven't have a complain about that would be > that it is rare that pg_wal is wiped out before a standby connects to > a just-promoted primary. I'm not sure about the tool Dilip is using, > though.. > > So the result is the attached. This would be back-patcheable to 9.3 > (or 9.6?) but I doubt that we should do as we don't seem to have had a > complaint on this issue and we're not full faith on this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, May 14, 2021 at 12:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I am not sure that have you noticed the commit id which changed the > definition of expectedTLEs, Heikki has committed that change so adding > him in the list to know his opinion. I did notice, but keep in mind that this was more than 8 years ago. Even if Heikki is reading this thread, he may not remember why he changed 1 line of code one way rather than another in 2013. I mean if he does that's great, but it's asking a lot. > ===== > ee994272ca50f70b53074f0febaec97e28f83c4e > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2013-01-03 14:11:58 > > Delay reading timeline history file until it's fetched from master. > ..... > Without the timeline history file, recovering that file > will fail as the older timeline ID is not recognized to be an ancestor of > the target timeline. If you try to recover from such a backup, using only > streaming replication to fetch the WAL, this patch is required for that to > work. > ===== > > Part of this commit message says that it will not identify the > recoveryTargetTLI as the ancestor of the checkpoint timeline (without > history file). I did not understand what it is trying to say. Does > it is trying to say that even though the recoveryTargetTLI is the > ancestor of the checkpoint timeline but we can not track that because > we don't have a history file? So to handle this problem change the > definition of expectedTLEs to directly point to the checkpoint > timeline? > > Because before this commit, we were directly initializing expectedTLEs > with the history file of recoveryTargetTLI, we were not even waiting > for reading the checkpoint, but under this commit, it is changed. Well, I think that is talking about what the commit did in general, not specifically the one line of code that we think may be incorrect. As I understand it, the general issue here was that if XLogFileReadAnyTLI() was called before expectedTLEs got set, then prior to this commit it would have to fail, because the foreach() loop in that function would be iterating over an empty list. Heikki tried to make it not fail in that case, by setting tles = readTimeLineHistory(recoveryTargetTLI), so that the foreach loop *wouldn't* get an empty list. Thinking about this a bit more, I think the idea behind the logic this commit added to XLogFileReadAnyTLI() is that XLogFileReadAnyTLI(recoveryTargetTLI) may or may not produce the correct answer. If the timeline history file exists, it will contain all the information that we need and will return a complete list of TLEs. But if the file does not exist yet, then it will return a 1-entry list saying that the TLI in question has no parents. If readTimeLineHistory() actually reads the file, then it's safe to save the return value in expectedTLEs, but if it doesn't, then it may or may not be safe. If XLogFileReadAnyTLI calls XLogFileRead and it works, then the WAL segment we need exists on our target timeline and we don't actually need the timeline history for anything because we can just directly begin replay from the target timeline. But if XLogFileRead fails with the 1-entry dummy list, then we need the timeline history and don't have it yet, so we have to retry later, when the history file will hopefully be present, and then at that point readTimeLineHistory will return a different and better answer and hopefully it will all work. I think this is what the commit message is talking about when it says that "Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline." Without the timeline history file, we can't know whether some other timeline is an ancestor or not. But the specific way that manifests is that XLogFileReadAnyTLI() returns a 1-entry dummy list instead of the real contents of the timeline history file. This commit doesn't prevent that from happening, but it does prevent the 1-entry dummy list from getting stored in the global variable expectedTLEs, except in the case where no timeline switch is occurring and the lack of history therefore doesn't matter. Without this commit, if the call to readTimeLineHistory(recoveryTargetTLI) happens at a time when the timeline history file is not yet available, the 1-entry dummy list ends up in the global variable and there's no way for it to ever be replaced with a real history even if the timeline history file shows up in the archive later. As I see it, the question on the table here is whether there's any justification for the fact that when the second switch in WaitForWALToBecomeAvailable takes the XLOG_FROM_ARCHIVE/XLOG_FROM_PG_WAL branch, it calls XLogFileReadAnyTLI which tries to read the history of recoveryTargetTLI, while when that same switch takes the XLOG_FROM_STREAM branch, it tries to read the history of receiveTLI. I tend to think it doesn't make sense. On general principle, archiving and streaming are supposed to work the same way, so the idea that they are getting the timeline from different places is inherently suspect. But also and more specifically, AFAICS receiveTLI always has to be the same TLI that we requested from the server, so we're always looking up our own current TLI here rather than the target TLI, which seems wrong to me, at least of this moment. :-) But that having been said, I still don't quite understand the conditions required to tickle this problem. I spent a long time poking at it today. It seems to me that it ought somehow to be possible to recreate the scenario without trying to reuse the old master as a standby, and also without even needing a WAL archive, but I couldn't figure out how to do it. I tried setting up a primary and a standby, and then making a backup from the standby, promoting it, and then starting what would have been a cascading standby from the backup. But that doesn't do it. The first mistake I made was creating the standbys with something like 'pg_basebackup -R', but that's not good enough because then they have WAL, so I added '-Xnone'. But then I realized that when a base backup ends, the primary writes an XLOG_SWITCH record, which means that when the standby is promoted, the promotion is not in the same WAL segment as the checkpoint from which the machine that would have been a cascading standby is trying to start. I worked around that by setting recovery_target='immediate' on the standby. With that change, I get a WAL file on the new timeline - 2 in this case - that looks like this: rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/19000060, prev 0/19000028, desc: CHECKPOINT_ONLINE redo 0/19000028; tli 1; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 587; online rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn: 0/190000D8, prev 0/19000060, desc: BACKUP_END 0/19000028 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/19000100, prev 0/190000D8, desc: CHECKPOINT_SHUTDOWN redo 0/19000100; tli 2; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown That sure looks like the right thing to recreate the problem, because the first checkpoint is from the backup, and the woulda-been-a-cascading-standby should be starting there, and the second checkpoint is in the same segment and shows a timeline switch. But everything worked great: 2021-05-14 17:44:58.684 EDT [5697] DETAIL: End of WAL reached on timeline 1 at 0/19000100. 2021-05-14 17:44:58.728 EDT [5694] LOG: new target timeline is 2 2021-05-14 17:44:58.746 EDT [5694] LOG: redo starts at 0/19000028 2021-05-14 17:44:58.749 EDT [5694] LOG: consistent recovery state reached at 0/19000100 I don't understand why that works. It feels to me like it ought to run smack into the same problem you saw, but it doesn't. > I am referring to the below code which was deleted by this commit: > > ======== > @@ -5279,49 +5299,6 @@ StartupXLOG(void) > */ > readRecoveryCommandFile(); > > - /* Now we can determine the list of expected TLIs */ > - expectedTLEs = readTimeLineHistory(recoveryTargetTLI); > - > - /* > - * If the location of the checkpoint record is not on the expected > - * timeline in the history of the requested timeline, we cannot proceed: > - * the backup is not part of the history of the requested timeline. > - */ > - if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != > - ControlFile->checkPointCopy.ThisTimeLineID) > - { > ========= I don't think this code is really deleted. The tliOfPointInHistory check was just moved later in the function. And expectedTLEs is still supposed to be getting initialized, because just before the new location of the tliOfPointInHistory check, Heikki added Assert(expectedTLEs). -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, May 15, 2021 at 3:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I did notice, but keep in mind that this was more than 8 years ago. > Even if Heikki is reading this thread, he may not remember why he > changed 1 line of code one way rather than another in 2013. I mean if > he does that's great, but it's asking a lot. I agree with your point. But I think that one line is related to the purpose of this commit and I have explained (in 3rd paragraph below) why do I think so. As I understand it, the general issue here was that if > XLogFileReadAnyTLI() was called before expectedTLEs got set, then > prior to this commit it would have to fail, because the foreach() loop > in that function would be iterating over an empty list. Heikki tried > to make it not fail in that case, by setting tles = > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop > *wouldn't* get an empty list. I might be missing something but I don't agree with this logic. If you see prior to this commit the code flow was like below[1]. So my point is if we are calling XLogFileReadAnyTLI() somewhere while reading the checkpoint record then note that expectedTLEs were initialized unconditionally before even try to read that checkpoint record. So how expectedTLEs could be uninitialized in LogFileReadAnyTLI? [1] StartupXLOG(void) { .... recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; ... readRecoveryCommandFile(); ... expectedTLEs = readTimeLineHistory(recoveryTargetTLI); ... .. record = ReadCheckpointRecord(checkPointLoc, 0); Another point which I am not sure about but still I think that one line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere related to the purpose of this commit. Let me explain why do I think so. Basically, before this commit, we were initializing "expectedTLEs" based on the history file of "recoveryTargetTLI", right after calling "readRecoveryCommandFile()" (this function will initialize recoveryTargetTLI based on the recovery target, and it ensures it read the respective history file). Now, right after this point, there was a check as shown below[2], which is checking whether the checkpoint TLI exists in the "expectedTLEs" which is initialized based on recoveryTargetTLI. And it appeared that this check was failing in some cases which this commit tried to fix and all other code is there to support that. Because now before going for reading the checkpoint we are not initializing "expectedTLEs" so now after moving this line from here it was possible that "expectedTLEs" is not initialized in XLogFileReadAnyTLI() and the remaining code in XLogFileReadAnyTLI() is to handle that part. Now, coming to my point that why this one line is related, In this one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we completely avoiding recoveryTargetTLI and initializing "expectedTLEs" based on the history file of the TL from which we read the checkpoint, so now, there is no scope of below[2] check to fail because note that we are not initializing "expectedTLEs" based on the "recoveryTargetTLI" but we are initializing from the history from where we read checkpoint. So I feel if we directly fix this one line to initialize "expectedTLEs" from "recoveryTargetTLI" then it will expose to the same problem this commit tried to fix. [2] if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != ControlFile->checkPointCopy.ThisTimeLineID) { error() } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Sat, 15 May 2021 10:55:05 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Sat, May 15, 2021 at 3:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I did notice, but keep in mind that this was more than 8 years ago. > > Even if Heikki is reading this thread, he may not remember why he > > changed 1 line of code one way rather than another in 2013. I mean if > > he does that's great, but it's asking a lot. > > I agree with your point. But I think that one line is related to the > purpose of this commit and I have explained (in 3rd paragraph below) > why do I think so. > > As I understand it, the general issue here was that if > > XLogFileReadAnyTLI() was called before expectedTLEs got set, then > > prior to this commit it would have to fail, because the foreach() loop > > in that function would be iterating over an empty list. Heikki tried > > to make it not fail in that case, by setting tles = > > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop > > *wouldn't* get an empty list. > > I might be missing something but I don't agree with this logic. If > you see prior to this commit the code flow was like below[1]. So my > point is if we are calling XLogFileReadAnyTLI() somewhere while > reading the checkpoint record then note that expectedTLEs were > initialized unconditionally before even try to read that checkpoint > record. So how expectedTLEs could be uninitialized in > LogFileReadAnyTLI? Mmm. I think both of you are right. Before the commit, XLogFileReadAnyTLI expected that expectedTLEs is initialized. After the commit it cannot no longer expect that so readTimeLineHistory was changed to try fetching by itself. *If* an appropriate history file is found, it *initializes* expectedTLEs with the content. > [1] > StartupXLOG(void) > { > .... > > recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; > ... > readRecoveryCommandFile(); > ... > expectedTLEs = readTimeLineHistory(recoveryTargetTLI); > ... > .. > record = ReadCheckpointRecord(checkPointLoc, 0); > > > Another point which I am not sure about but still I think that one > line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere > related to the purpose of this commit. Let me explain why do I think > so. Basically, before this commit, we were initializing > "expectedTLEs" based on the history file of "recoveryTargetTLI", right > after calling "readRecoveryCommandFile()" (this function will > initialize recoveryTargetTLI based on the recovery target, and it > ensures it read the respective history file). Now, right after this > point, there was a check as shown below[2], which is checking whether > the checkpoint TLI exists in the "expectedTLEs" which is initialized > based on recoveryTargetTLI. And it appeared that this check was > failing in some cases which this commit tried to fix and all other > code is there to support that. Because now before going for reading > the checkpoint we are not initializing "expectedTLEs" so now after > moving this line from here it was possible that "expectedTLEs" is not > initialized in XLogFileReadAnyTLI() and the remaining code in > XLogFileReadAnyTLI() is to handle that part. Before the commit expectedTLEs is always initialized with just one entry for the TLI of the last checkpoint record. (1) If XLogFileReadAnyTLI() found the segment but no history file found, that is, using the dummy TLE-list, expectedTLEs is initialized with the dummy one-entry list. So there's no behavioral change in this aspect. (2) If we didn't find the segment for the checkpoint record, it starts replication and fetches history files and WAL records then revisits XLogFileReadAnyTLI. Now we have both the history file and segments, it successfully reads the recood. The difference of expectedTLEs made by the patch is having just one entry or the all entries for the past. Assuming that we keep expectedTLEs synced with recoveryTargetTLI, rescanLatestTimeLine updates the list properly so no need to worry about the future. So the issue would be in the past timelines. After reading the checkpoint record, if we need to rewind to the previous timeline for the REDO point, the dummy list is inconvenient. So there is a possibility that the patch fixed the case (2), where the standby doesn't have both the segment for the checkpoint record and the history file for the checkpoint, and the REDO point is on the last TLI. If it is correct, the patch still fails for the case (1), that is, the issue raised here. Anyway it would be useless (and rahter harmful) to initialize expectedTLEs based on receiveTLI there. So my resul there is: The commit fixed the case (2) The fix caused the issue for the case (1). The proposed fix fixes the case (1), caused by the commit. There's another issue in the case (1) and REDO point is back to the previous timeline, which is in doubt we need to fix.. > Now, coming to my point that why this one line is related, In this > one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we > completely avoiding recoveryTargetTLI and initializing "expectedTLEs" > based on the history file of the TL from which we read the checkpoint, > so now, there is no scope of below[2] check to fail because note that > we are not initializing "expectedTLEs" based on the > "recoveryTargetTLI" but we are initializing from the history from > where we read checkpoint. > > So I feel if we directly fix this one line to initialize > "expectedTLEs" from "recoveryTargetTLI" then it will expose to the > same problem this commit tried to fix. > > [2] > if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != > ControlFile->checkPointCopy.ThisTimeLineID) > { > error() > } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Assuming that we keep expectedTLEs synced with recoveryTargetTLI, > rescanLatestTimeLine updates the list properly so no need to worry > about the future. So the issue would be in the past timelines. After > reading the checkpoint record, if we need to rewind to the previous > timeline for the REDO point, the dummy list is inconvenient. By the way, I tried reproducing this situation, but ended in finding it a kind of impossible because pg_basebackup (or pg_stop_backup()) waits for the promotion checkpoint to end. If we make a backup in a somewhat broken steps, that could be done but I didn't try. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 17 May 2021 13:01:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > Assuming that we keep expectedTLEs synced with recoveryTargetTLI, > > rescanLatestTimeLine updates the list properly so no need to worry > > about the future. So the issue would be in the past timelines. After > > reading the checkpoint record, if we need to rewind to the previous > > timeline for the REDO point, the dummy list is inconvenient. > > By the way, I tried reproducing this situation, but ended in finding > it a kind of impossible because pg_basebackup (or pg_stop_backup()) > waits for the promotion checkpoint to end. Mmm. That's wrong. What the tool waits is not a promotion checkpoint, but a backup-checkpoint, maybe. (I don't remember cleary about that, sorry.) > If we make a backup in a somewhat broken steps, that could be done but > I didn't try. So there might still be a way to reproduce that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Before the commit expectedTLEs is always initialized with just one > entry for the TLI of the last checkpoint record. Right > (1) If XLogFileReadAnyTLI() found the segment but no history file > found, that is, using the dummy TLE-list, expectedTLEs is initialized > with the dummy one-entry list. So there's no behavioral change in this > aspect. Yeah, you are right. > (2) If we didn't find the segment for the checkpoint record, it starts > replication and fetches history files and WAL records then revisits > XLogFileReadAnyTLI. Now we have both the history file and segments, > it successfully reads the recood. The difference of expectedTLEs made > by the patch is having just one entry or the all entries for the past. Correct. > Assuming that we keep expectedTLEs synced with recoveryTargetTLI, > rescanLatestTimeLine updates the list properly so no need to worry > about the future. So the issue would be in the past timelines. After > reading the checkpoint record, if we need to rewind to the previous > timeline for the REDO point, the dummy list is inconvenient. > > So there is a possibility that the patch fixed the case (2), where the > standby doesn't have both the segment for the checkpoint record and > the history file for the checkpoint, and the REDO point is on the last > TLI. If it is correct, the patch still fails for the case (1), that > is, the issue raised here. Anyway it would be useless (and rahter > harmful) to initialize expectedTLEs based on receiveTLI there. > > So my resul there is: > > The commit fixed the case (2) Yes, by maintaining the entire history instead of one entry if history was missing. > The fix caused the issue for the case (1). Basically, before this commit expectedTLEs and recoveryTargetTLI were in always in sync which this patch broke. > The proposed fix fixes the case (1), caused by the commit. Right, I agree with the fix. So fix should be just to change that one line and initialize expectedTLEs from recoveryTargetTLI -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, May 17, 2021 at 10:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Before the commit expectedTLEs is always initialized with just one > > entry for the TLI of the last checkpoint record. > > Right > > > (1) If XLogFileReadAnyTLI() found the segment but no history file > > found, that is, using the dummy TLE-list, expectedTLEs is initialized > > with the dummy one-entry list. So there's no behavioral change in this > > aspect. > > Yeah, you are right. But do you agree that one line entry will always be a checkpoint timeline entry? Because if you notice below code[1] in function "readRecoveryCommandFile();", then you will realize that once we come out of this function either the "recoveryTargetTLI" is checkpoint TL wherever it was before calling this function or we must have the history file. That means after exiting this function if we execute this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);) that means either "expectedTLEs" could point to one dummy entry which will be nothing but the checkpoint TL entry or it will be holding complete history. The patch is trying to say that without the history file the checkpoint TL will not be found in "expectedTLEs" because the older TL (checkpoint TL) is not the ancestor of the target timeline(recoveryTargetTLI). But ideally, either the target timeline should be the same as the checkpoint timeline or we must have the history file as I stated in the above paragraph. Am I missing something? [1] if (rtli) { /* Timeline 1 does not have a history file, all else should */ if (rtli != 1 && !existsTimeLineHistory(rtli)) ereport(FATAL, (errmsg("recovery target timeline %u does not exist", rtli))); recoveryTargetTLI = rtli; recoveryTargetIsLatest = false; } else { /* We start the "latest" search from pg_control's timeline */ recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI); recoveryTargetIsLatest = true; } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, May 15, 2021 at 1:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > As I understand it, the general issue here was that if > > XLogFileReadAnyTLI() was called before expectedTLEs got set, then > > prior to this commit it would have to fail, because the foreach() loop > > in that function would be iterating over an empty list. Heikki tried > > to make it not fail in that case, by setting tles = > > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop > > *wouldn't* get an empty list. > > I might be missing something but I don't agree with this logic. If > you see prior to this commit the code flow was like below[1]. So my > point is if we are calling XLogFileReadAnyTLI() somewhere while > reading the checkpoint record then note that expectedTLEs were > initialized unconditionally before even try to read that checkpoint > record. So how expectedTLEs could be uninitialized in > LogFileReadAnyTLI? Sorry, you're right. It couldn't be uninitialized, but it could be a fake 1-element list saying there are no ancestors rather than the real value. So I think the point was to avoid that. > Another point which I am not sure about but still I think that one > line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere > related to the purpose of this commit. Let me explain why do I think > so. Basically, before this commit, we were initializing > "expectedTLEs" based on the history file of "recoveryTargetTLI", right > after calling "readRecoveryCommandFile()" (this function will > initialize recoveryTargetTLI based on the recovery target, and it > ensures it read the respective history file). Now, right after this > point, there was a check as shown below[2], which is checking whether > the checkpoint TLI exists in the "expectedTLEs" which is initialized > based on recoveryTargetTLI. And it appeared that this check was > failing in some cases which this commit tried to fix and all other > code is there to support that. Because now before going for reading > the checkpoint we are not initializing "expectedTLEs" so now after > moving this line from here it was possible that "expectedTLEs" is not > initialized in XLogFileReadAnyTLI() and the remaining code in > XLogFileReadAnyTLI() is to handle that part. I think the issue here is: If expectedTLEs was initialized before the history file was available, and contained a dummy 1-element list, then tliOfPointInHistory() is going to say that every LSN is on that timeline rather than any previous timeline. And if we are supposed to be switching timelines then that will lead to this sanity check failing. > Now, coming to my point that why this one line is related, In this > one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we > completely avoiding recoveryTargetTLI and initializing "expectedTLEs" > based on the history file of the TL from which we read the checkpoint, > so now, there is no scope of below[2] check to fail because note that > we are not initializing "expectedTLEs" based on the > "recoveryTargetTLI" but we are initializing from the history from > where we read checkpoint. I agree, but that's actually bad, isn't it? I mean if we want the sanity check to never fail we can just take it out. What we want to happen is that the sanity check should pass if the startup timeline if the TLI of the startup checkpoint is in the history of the recovery target timeline, but fail if it isn't. The only way to achieve that behavior is if expectedTLEs is initialized from the recovery target timeline. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 18, 2021 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Sorry, you're right. It couldn't be uninitialized, but it could be a > fake 1-element list saying there are no ancestors rather than the real > value. So I think the point was to avoid that. Yeah, it will be a fake 1-element list. But just to be clear that 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and nothing else, do you agree to this? Because we initialize recoveryTargetTLI to this value and we might change it in readRecoveryCommandFile() but for that, we must get the history file, so if we are talking about the case where we don't have the history file then "recoveryTargetTLI" will still be "ControlFile->checkPointCopy.ThisTimeLineID". > > I think the issue here is: If expectedTLEs was initialized before the > history file was available, and contained a dummy 1-element list, then > tliOfPointInHistory() is going to say that every LSN is on that > timeline rather than any previous timeline. And if we are supposed to > be switching timelines then that will lead to this sanity check > failing. You are talking about the sanity check of validating the timeline of the checkpoint record right? but as I mentioned earlier the only entry in expectedTLEs will be the TLE of the checkpoint record so how the sanity check will fail? > > I agree, but that's actually bad, isn't it? Yes, it is bad. I mean if we want the > sanity check to never fail we can just take it out. What we want to > happen is that the sanity check should pass if the startup timeline if > the TLI of the startup checkpoint is in the history of the recovery > target timeline, but fail if it isn't. The only way to achieve that > behavior is if expectedTLEs is initialized from the recovery target > timeline. Yes, I agree, with this. So initializing expectedTLEs with the recovery target timeline is the right fix. Conclusion: - I think now we agree on the point that initializing expectedTLEs with the recovery target timeline is the right fix. - We still have some differences of opinion about what was the original problem in the base code which was fixed by the commit (ee994272ca50f70b53074f0febaec97e28f83c4e). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Mon, 17 May 2021 10:46:24 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Mon, May 17, 2021 at 10:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > Before the commit expectedTLEs is always initialized with just one > > > entry for the TLI of the last checkpoint record. > > > > Right > > > > > (1) If XLogFileReadAnyTLI() found the segment but no history file > > > found, that is, using the dummy TLE-list, expectedTLEs is initialized > > > with the dummy one-entry list. So there's no behavioral change in this > > > aspect. > > > > Yeah, you are right. > > But do you agree that one line entry will always be a checkpoint > timeline entry? Because if you notice below code[1] in function > "readRecoveryCommandFile();", then you will realize that once we come > out of this function either the "recoveryTargetTLI" is checkpoint TL > wherever it was before calling this function or we must have the > history file. That means after exiting this function if we execute > this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);) > that means either "expectedTLEs" could point to one dummy entry which > will be nothing but the checkpoint TL entry or it will be holding > complete history. Right. > The patch is trying to say that without the history file the > checkpoint TL will not be found in "expectedTLEs" because the older TL > (checkpoint TL) is not the ancestor of the target > timeline(recoveryTargetTLI). But ideally, either the target timeline > should be the same as the checkpoint timeline or we must have the > history file as I stated in the above paragraph. Am I missing > something? Yeah, that has been the most mysterious point here. So I searched for a situation the one-entry expectedTLEs does not work. I vaguely believed that there's a case where REDO point of a checkpoint is in the timeline previous to the record of the checkpoint. The previous discussion is based on this case, but that doesn't seem to happen. The last replayed checkpoint (that causes a restartpoint) record is found before protmotion and the first checkpoint starts after promotion. A little while ago I tried to make a situation where a checkpoint record is placed in the previous timeline of the tli written in the control file. But control file is always written after checkpoint record is flushed. I rebooted myself from this: ee994272ca: > There is at least one scenario where this makes a difference: if you take > a base backup from a standby server right after a timeline switch, the > WAL segment containing the initial checkpoint record will begin with an > older timeline ID. Without the timeline history file, recovering that file And finally I think I could reach the situation the commit wanted to fix. I took a basebackup from a standby just before replaying the first checkpoint of the new timeline (by using debugger), without copying pg_wal. In this backup, the control file contains checkPointCopy of the previous timeline. I modified StartXLOG so that expectedTLEs is set just after first determining recoveryTargetTLI, then started the grandchild node. I have the following error and the server fails to continue replication. [postmaster] LOG: starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu... [startup] LOG: database system was interrupted while in recovery at log... [startup] LOG: set expectedtles tli=6, length=1 [startup] LOG: Probing history file for TLI=7 [startup] LOG: entering standby mode [startup] LOG: scanning segment 3 TLI 6, source 0 [startup] LOG: Trying fetching history file for TLI=6 [walreceiver] LOG: fetching timeline history file for timeline 5 from pri... [walreceiver] LOG: fetching timeline history file for timeline 6 from pri... [walreceiver] LOG: started streaming ... primary at 0/3000000 on timeline 5 [walreceiver] DETAIL: End of WAL reached on timeline 5 at 0/30006E0. [startup] LOG: unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0 [startup] LOG: Probing history file for TLI=7 [startup] LOG: scanning segment 3 TLI 6, source 0 (repeats forever) This seems like the behavior the patch wanted to fix. (I'm not sure precisely what happened at the time of the "unexpected timeline ID 1..", though. The line is seen only just after the first conection.) > will fail as the older timeline ID is not recognized to be an ancestor of > the target timeline. If you try to recover from such a backup, using only > streaming replication to fetch the WAL, this patch is required for that to > work. After I reverted the modification, I got the following behavior instead from the same backup. [postmaster] LOG: starting PostgreSQL 14beta1 on x86_64-... [startup] JST LOG: database system was interrupted while in recovery at log time 2021-05-18 13:45:59 JST [startup] JST LOG: Probing history file for TLI=7 [startup] JST LOG: entering standby mode [startup] JST LOG: Loading history file for TLI=6 (2) [startup] JST LOG: Trying reading history file for TLI=6 [startup] JST LOG: scanning segment 3 TLI 6, source 0 [startup] JST LOG: Trying fetching history file for TLI=6 [walreceiver] JST LOG: fetching timeline history file for timeline 5 fro... [walreceiver] JST LOG: fetching timeline history file for timeline 6 fro... [walreceiver] JST LOG: started streaming ... primary at 0/3000000 on timeline 5 [walreceiver] JST LOG: replication terminated by primary server [walreceiver] JST DETAIL: End of WAL reached on timeline 5 at 0/30006E0. [startup] LOG: Loading expectedTLEs for 5 [startup] LOG: Trying reading history file for TLI=5 [startup] LOG: Checkpoint record: TLI=5, 0/3000668, rectargetTLI=6, exptles=0x3322a60 [startup] FATAL: requested timeline 6 does not contain minimum recovery point 0/30007C0 on timeline 6 [postmaster] LOG: startup process (PID 76526) exited with exit code 1 [postmaster] LOG: aborting startup due to startup process failure [postmaster] LOG: database system is shut down Aborts.. Yeah, this is the same issue with what is railed here. So, still I'm not sure I confirmed the case exactly (since the problem is still seen.. but I don't want to bother building the version.)... Anyway reading history file for recoveryTargetTLI instead of receiveTLI fixes that. FWIW, you could be get a problematic base backup by the following steps. 0. (make sure /tmp/hoge is removed) 1. apply the attached patch 2. create a primary then start 3. create a standby then start 4. place standby.signal to the primary, then restart it. 5. place the file /tmp/hoge. 6. promote the "primary". 7. You will see a log line like this LOG: WAIT START: CHECKPOINT_ONLINE: TLI=2 8. Take a base backup (without copying WAL files) > [1] > if (rtli) > { > /* Timeline 1 does not have a history file, all else should */ > if (rtli != 1 && !existsTimeLineHistory(rtli)) > ereport(FATAL, > (errmsg("recovery target timeline %u does not exist", > rtli))); > recoveryTargetTLI = rtli; > recoveryTargetIsLatest = false; > } > else > { > /* We start the "latest" search from pg_control's timeline */ > recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI); > recoveryTargetIsLatest = true; > } regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..9483fd055c 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end) for (tli = begin; tli < end; tli++) { + elog(LOG, "Trying restoring history file for TLI=%d", tli); if (tli == 1) continue; @@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Trying reading history file for TLI=%d", targetTLI); TLHistoryFileName(histfname, targetTLI); fromArchive = RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); @@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Probing history file for TLI=%d", probeTLI); TLHistoryFileName(histfname, probeTLI); RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8d163f190f..afd6a0ce0a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3726,7 +3726,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli, snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", xlogfname); set_ps_display(activitymsg); - + elog(LOG, "Trying fetching history file for TLI=%d", tli); restoredFromArchive = RestoreArchivedFile(path, xlogfname, "RECOVERYXLOG", wal_segment_size, @@ -3820,7 +3820,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) if (expectedTLEs) tles = expectedTLEs; else + { + elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI); tles = readTimeLineHistory(recoveryTargetTLI); + } foreach(cell, tles) { @@ -3834,6 +3837,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) * Skip scanning the timeline ID that the logfile segment to read * doesn't belong to */ + elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source); + if (hent->begin != InvalidXLogRecPtr) { XLogSegNo beginseg = 0; @@ -3860,6 +3865,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_ARCHIVE, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from archive", segno, tli); elog(DEBUG1, "got WAL segment from archive"); if (!expectedTLEs) expectedTLEs = tles; @@ -3873,6 +3879,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_PG_WAL, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli); if (!expectedTLEs) expectedTLEs = tles; return fd; @@ -6577,6 +6584,8 @@ StartupXLOG(void) else recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); + elog(LOG, "set expectedtles %d, %d", recoveryTargetTLI, list_length(expectedTLEs)); /* * Check for signal files, and if so set up state for offline recovery */ @@ -6866,11 +6875,19 @@ StartupXLOG(void) if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint) && tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) != ControlFile->minRecoveryPointTLI) + { + ListCell *lc; + foreach (lc, expectedTLEs) + { + TimeLineHistoryEntry *tle = (TimeLineHistoryEntry *) lfirst(lc); + elog(LOG, "TLE %d {%X/%X - %X/%X}", tle->tli, LSN_FORMAT_ARGS(tle->begin), LSN_FORMAT_ARGS(tle->end)); + } ereport(FATAL, (errmsg("requested timeline %u does not contain minimum recovery point %X/%X on timeline %u", recoveryTargetTLI, LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint), ControlFile->minRecoveryPointTLI))); + } LastRec = RecPtr = checkPointLoc; @@ -8396,7 +8413,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, XLogBeginRead(xlogreader, RecPtr); record = ReadRecord(xlogreader, LOG, true); - + elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli, LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs); if (record == NULL) { if (!report) @@ -10211,6 +10228,19 @@ xlog_redo(XLogReaderState *record) CheckPoint checkPoint; memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint)); + { + struct stat b; + bool f = true; + while (stat("/tmp/hoge", &b) == 0) + { + if (f) + elog(LOG, "WAIT START: CHECKPOINT_ONLINE: TLI=%d", checkPoint.ThisTimeLineID); + f = false; + sleep(1); + } + if (!f) + elog(LOG, "WAIT END: CHECKPOINT_ONLINE"); + } /* In an ONLINE checkpoint, treat the XID counter as a minimum */ LWLockAcquire(XidGenLock, LW_EXCLUSIVE); if (FullTransactionIdPrecedes(ShmemVariableCache->nextXid, @@ -12595,7 +12625,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * TLI, rather than the position we're reading. */ tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - + elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli); if (curFileTLI > 0 && tli < curFileTLI) elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", LSN_FORMAT_ARGS(tliRecPtr), @@ -12662,7 +12692,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); + { + + elog(LOG, "Loading expectedTLEs for %d (%d)", recoveryTargetTLI, receiveTLI); + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); + } readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false);
At Tue, 18 May 2021 15:52:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > FWIW, you could be get a problematic base backup by the following steps. > > 0. (make sure /tmp/hoge is removed) > 1. apply the attached patch > 2. create a primary then start > 3. create a standby then start > 4. place standby.signal to the primary, then restart it. > 5. place the file /tmp/hoge. > 6. promote the "primary". > 7. You will see a log line like this > LOG: WAIT START: CHECKPOINT_ONLINE: TLI=2 > 8. Take a base backup (without copying WAL files) I carelessly have left the "modification" uncommented in the diff file. @@ -6577,6 +6584,8 @@ StartupXLOG(void) else recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); + elog(LOG, "set expectedtles %d, %d", recoveryTargetTLI, list_length(expectedTLEs)); Disabling the lines would show the result of the ancient fix. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, May 18, 2021 at 12:22 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > And finally I think I could reach the situation the commit wanted to fix. > > I took a basebackup from a standby just before replaying the first > checkpoint of the new timeline (by using debugger), without copying > pg_wal. In this backup, the control file contains checkPointCopy of > the previous timeline. > > I modified StartXLOG so that expectedTLEs is set just after first > determining recoveryTargetTLI, then started the grandchild node. I > have the following error and the server fails to continue replication. > [postmaster] LOG: starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu... > [startup] LOG: database system was interrupted while in recovery at log... > [startup] LOG: set expectedtles tli=6, length=1 > [startup] LOG: Probing history file for TLI=7 > [startup] LOG: entering standby mode > [startup] LOG: scanning segment 3 TLI 6, source 0 > [startup] LOG: Trying fetching history file for TLI=6 > [walreceiver] LOG: fetching timeline history file for timeline 5 from pri... > [walreceiver] LOG: fetching timeline history file for timeline 6 from pri... > [walreceiver] LOG: started streaming ... primary at 0/3000000 on timeline 5 > [walreceiver] DETAIL: End of WAL reached on timeline 5 at 0/30006E0. > [startup] LOG: unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0 > [startup] LOG: Probing history file for TLI=7 > [startup] LOG: scanning segment 3 TLI 6, source 0 > (repeats forever) So IIUC, this logs shows that "ControlFile->checkPointCopy.ThisTimeLineID" is 6 but "ControlFile->checkPoint" record is on TL 5? I think if you had the old version of the code (before the commit) or below code [1], right after initializing expectedTLEs then you would have hit the FATAL the patch had fix. While debugging did you check what was the "ControlFile->checkPoint" LSN vs the first LSN of the first segment with TL6? expectedTLEs = readTimeLineHistory(recoveryTargetTLI); [1] if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) != ControlFile->checkPointCopy.ThisTimeLineID) { report(FATAL.. } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah, it will be a fake 1-element list. But just to be clear that > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and > nothing else, do you agree to this? Because we initialize > recoveryTargetTLI to this value and we might change it in > readRecoveryCommandFile() but for that, we must get the history file, > so if we are talking about the case where we don't have the history > file then "recoveryTargetTLI" will still be > "ControlFile->checkPointCopy.ThisTimeLineID". I don't think your conclusion is correct. As I understand it, you're talking about the state before ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now readRecoveryCommandFile() no longer exists in master. Before that commit, StartupXLOG did this: recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; readRecoveryCommandFile(); expectedTLEs = readTimeLineHistory(recoveryTargetTLI); Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before doing so, it will call existsTimeLineHistory if recovery_target_timeline was set to an integer, and findNewestTimeLine if it was set to latest. In the first case, existsTimeLineHistory() calls RestoreArchivedFile(), but that restores it using a temporary name, and KeepFileRestoredFromArchive is not called, so we might have the timeline history in RECOVERYHISTORY but that's not the filename we're actually going to try to read from inside readTimeLineHistory(). In the second case, findNewestTimeLine() will call existsTimeLineHistory() which results in the same situation. So I think when readRecoveryCommandFile() returns expectedTLI can be different but the history file can be absent since it was only ever restored under a temporary name. > Conclusion: > - I think now we agree on the point that initializing expectedTLEs > with the recovery target timeline is the right fix. > - We still have some differences of opinion about what was the > original problem in the base code which was fixed by the commit > (ee994272ca50f70b53074f0febaec97e28f83c4e). I am also still concerned about whether we understand in exactly what cases the current logic doesn't work. We seem to more or less agree on the fix, but I don't think we really understand precisely what case we are fixing. -- Robert Haas EDB: http://www.enterprisedb.com
At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah, it will be a fake 1-element list. But just to be clear that > > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and > > nothing else, do you agree to this? Because we initialize > > recoveryTargetTLI to this value and we might change it in > > readRecoveryCommandFile() but for that, we must get the history file, > > so if we are talking about the case where we don't have the history > > file then "recoveryTargetTLI" will still be > > "ControlFile->checkPointCopy.ThisTimeLineID". > > I don't think your conclusion is correct. As I understand it, you're > talking about the state before > ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now > readRecoveryCommandFile() no longer exists in master. Before that > commit, StartupXLOG did this: > > recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; > readRecoveryCommandFile(); > expectedTLEs = readTimeLineHistory(recoveryTargetTLI); > > Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before > doing so, it will call existsTimeLineHistory if > recovery_target_timeline was set to an integer, and findNewestTimeLine > if it was set to latest. In the first case, existsTimeLineHistory() > calls RestoreArchivedFile(), but that restores it using a temporary > name, and KeepFileRestoredFromArchive is not called, so we might have > the timeline history in RECOVERYHISTORY but that's not the filename > we're actually going to try to read from inside readTimeLineHistory(). > In the second case, findNewestTimeLine() will call > existsTimeLineHistory() which results in the same situation. So I > think when readRecoveryCommandFile() returns expectedTLI can be > different but the history file can be absent since it was only ever > restored under a temporary name. Anyway it seems that the commit tried to fix an issue happens without using WAL archive. https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com > That leaves one case not covered: If you take a backup with plain > "pg_basebackup" from a standby, without -X, and the first WAL segment > contains a timeline switch (ie. you take the backup right after a > failover), and you try to recover from it without a WAL archive, it > doesn't work. This is the original issue that started this thread, > except that I used "-x" in my original test case. The problem here is > that even though streaming replication will fetch the timeline history > file when it connects, at the very beginning of recovery, we expect that > we already have the timeline history file corresponding the initial > timeline available, either in pg_xlog or the WAL archive. By the time > streaming replication has connected and fetched the history file, we've > already initialized expectedTLEs to contain just the latest TLI. To fix > that, we should delay calling readTimeLineHistoryFile() until streaming > replication has connected and fetched the file. > If the first segment read by recovery contains a timeline switch, the first > pages have older timeline than segment timeline. However, if > exepectedTLEs contained only the segment timeline, we cannot know if > we can use the record. In that case the following error is issued. If expectedTLEs is initialized with the pseudo list, tliOfPointInHistory always return the recoveryTargetTLI regardless of the given LSN so the checking about timelines later doesn't work. And later ReadRecord is surprised to see a page of an unknown timeline. "unexpected timeline ID 1 in log segment" So the objective is to initialize expectedTLEs with the right content of the history file for the recoveryTargetTLI until ReadRecord fetches the first record. After the fix things are working as the following. - recoveryTargetTimeLine is initialized with ControlFile->checkPointCopy.ThisTimeLineID - readRecoveryCommandFile(): Move recoveryTargetTLI forward to the specified target timline if the history file for the timeline is found, or in the case of latest, move it forward up to the maximum timeline among the history files found in either pg_wal or archive. !!! Anyway recoveryTargetTLI won't goes back behind the checkpoint TLI. - ReadRecord...XLogFileReadAnyTLI Tries to load the history file for recoveryTargetTLI either from pg_wal or archive onto local TLE list, if the history file is not found, use a generateed list with one entry for the recoveryTargetTLI. (a) If the specified segment file for any timeline in the TLE list is found, expectedTLEs is initialized with the local list. No need to worry about expectedTLEs any longer. (b) If such a segment is *not* found, expectedTLEs is left NIL. Usually recoveryTargetTLI is equal to the last checkpoint TLI. (c) However, in the case where timeline switches happened in the segment and the recoveryTargetTLI has been increased, that is, the history file for the recoveryTargetTLI is found in pg_wal or archive, that is, the issue raised here, recoveryTargetTLI becomes the future timline of the checkpoint TLI. (d) The history file for the recoveryTargetTLI is *not* found but the segment file is found, expectedTLEs is initialized with the generated list, which doesn't contain past timelines. In this case, recoveryTargetTLI has not moved from the initial value of the checkpoint TLI. If the REDO point is before a timeline switch, the page causes FATAL in ReadRecord later. However, I think there cannot be a case where segment file is found before corresponding history file. (Except for TLI=1, which is no problem.) - WaitForWALToBecomeAvailable if we have had no segments for the last checkpoint, initiate streaming from the REDO point of the last checkpoint. We should have all history files until receiving segment data. after sufficient WAL data has been received, the only cases where expectedTLEs is still NIL are the (b) and (c) above. In the case of (b) recoveryTargetTLI == checkpoint TLI. In the case of (c) recoveryTargetTLI > checkpoint TLI. In this case we expecte that checkpint TLI is in the history of recoveryTargetTLI. Otherwise recovery failse. This case is similar to the case (a) but the relationship between recoveryTargetTLI and the checkpoint TLI is not confirmed yet. ReadRecord barks later if they are not compatible so there's not a serious problem but might be better checking the relation ship there. My first proposal performed mutual check between the two but we need to check only unidirectionally. if (readFile < 0) { if (!expectedTLEs) { expectedTLEs = readTimeLineHistory(receiveTLI); + if (!tliOfPointInHistory(receiveTLI, expectedTLEs)) + ereport(ERROR, "the received timeline %d is not found in the history file for timeline %d"); > > Conclusion: > > - I think now we agree on the point that initializing expectedTLEs > > with the recovery target timeline is the right fix. > > - We still have some differences of opinion about what was the > > original problem in the base code which was fixed by the commit > > (ee994272ca50f70b53074f0febaec97e28f83c4e). > > I am also still concerned about whether we understand in exactly what > cases the current logic doesn't work. We seem to more or less agree on > the fix, but I don't think we really understand precisely what case we > are fixing. Does the discussion above make sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 21 May 2021 11:21:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > In the case of (c) recoveryTargetTLI > checkpoint TLI. In this case > we expecte that checkpint TLI is in the history of > recoveryTargetTLI. Otherwise recovery failse. This case is similar > to the case (a) but the relationship between recoveryTargetTLI and > the checkpoint TLI is not confirmed yet. ReadRecord barks later if > they are not compatible so there's not a serious problem but might > be better checking the relation ship there. My first proposal > performed mutual check between the two but we need to check only > unidirectionally. > > if (readFile < 0) > { > if (!expectedTLEs) > { > expectedTLEs = readTimeLineHistory(receiveTLI); > + if (!tliOfPointInHistory(receiveTLI, expectedTLEs)) > + ereport(ERROR, "the received timeline %d is not found in the history file for timeline %d"); > > > > > Conclusion: > > > - I think now we agree on the point that initializing expectedTLEs > > > with the recovery target timeline is the right fix. > > > - We still have some differences of opinion about what was the > > > original problem in the base code which was fixed by the commit > > > (ee994272ca50f70b53074f0febaec97e28f83c4e). > > > > I am also still concerned about whether we understand in exactly what > > cases the current logic doesn't work. We seem to more or less agree on > > the fix, but I don't think we really understand precisely what case we > > are fixing. > > Does the discussion above make sense? This is a revised version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 6664808340494dffc775e21cfa1029d6dfb8452e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 21 May 2021 16:24:14 +0900 Subject: [PATCH v3] Set expectedTLEs correctly based on recoveryTargetTLI When a standby started streaming before it determines expectedTLEs, it is determined based on the timeline of the WAL segment just streamed from the primary. If the standby fetches the older timeline than recoveryTargetTLI for thestarting checkpoint, this behavior leads to setting expectedTLEs based on the older timeline than recoveryTargetTLI. It is inconsistent with the definition of the variable and prevents later calls to rescanLatestTimeLine from updating recoveryTargetTLI and expectedTLEs correctly. If that happens the standby stops following the upstream. This behavior has been introduced by commit ee994272ca but there seems not a reason for the behavior and looks like a typo. Since the relationship between the two timeline IDs is not verified until then, also add a check for the relationship for safety's sake. --- src/backend/access/transam/xlog.c | 35 ++++++++++---- src/test/recovery/t/004_timeline_switch.pl | 55 +++++++++++++++++++++- 2 files changed, 79 insertions(+), 11 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 441a9124cd..fda729c63f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12659,22 +12659,37 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * info is set correctly and XLogReceiptTime isn't * changed. */ - if (readFile < 0) - { - if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); - readFile = XLogFileRead(readSegNo, PANIC, - receiveTLI, - XLOG_FROM_STREAM, false); - Assert(readFile >= 0); - } - else + if (readFile >= 0) { /* just make sure source info is correct... */ readSource = XLOG_FROM_STREAM; XLogReceiptSource = XLOG_FROM_STREAM; return true; } + + readFile = XLogFileRead(readSegNo, PANIC, + receiveTLI, + XLOG_FROM_STREAM, false); + Assert(readFile >= 0); + + if (expectedTLEs) + break; + + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); + + /* + * We haven't checked the relationship beween the + * receiveTLI and recoveryTargetTLI. Make sure that + * receiveTLI is in the history for + * recoveryTargetTLI. We don't need to do that + * hearafter since recoveryTargetTLI and expectedTLEs + * will be kept in sync. + */ + if (!tliOfPointInHistory(receiveTLI, expectedTLEs)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("hisotry file of timeline %d is incosistent with the current timeline %d", + recoveryTargetTLI, receiveTLI))); break; } diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index c101980e9e..357f3bf8fa 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -90,7 +90,14 @@ $node_primary_2->backup($backup_name); # Create standby node my $node_standby_3 = get_new_node('standby_3'); $node_standby_3->init_from_backup($node_primary_2, $backup_name, - has_streaming => 1); + has_streaming => 1, allows_streaming => 1, has_archiving => 1); +my $archivedir_standby_3 = $node_standby_3->archive_dir; +$node_standby_3->append_conf( + 'postgresql.conf', qq( +archive_mode = always +archive_command = 'cp "%p" "$archivedir_standby_3/%f"' +log_line_prefix = '%m [%p:%b] %q%a ' +)); # Restart primary node in standby mode and promote it, switching it # to a new timeline. @@ -109,3 +116,49 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay', my $result_2 = $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int"); is($result_2, qq(1), 'check content of standby 3'); + + +# Check that server can start from a backup took from a standby that +# contains timeline switch midst of the first segment. + +# setup a standby connects to standby_3 +my $node_standby_4 = get_new_node('standby_4'); +$node_standby_4->init_from_backup($node_primary_2, $backup_name, + has_streaming => 1); +$node_standby_4->start; + +$node_primary_2->psql('postgres', qq[ + CREATE TABLE t (a int); + SELECT pg_switch_wal(); + INSERT INTO t VALUES (0); + CHECKPOINT; +]); +$node_primary_2->wait_for_catchup($node_standby_3, 'write', + $node_primary_2->lsn('insert')); +$node_primary_2->wait_for_catchup($node_standby_4, 'write', + $node_primary_2->lsn('insert')); + +# promote standby_3 then connect standby_4 to standby_3. To enforce +# archive fetching, fluash pg_wal files of the new standby. +$node_standby_4->stop; +$node_standby_3->psql('postgres', 'SELECT pg_promote()'); +$node_standby_4->enable_streaming($node_standby_3); +$node_standby_4->append_conf( + 'postgresql.conf', + qq[restore_command = 'cp "$archivedir_standby_3/%f" "%p"']); + +# clean up pg_wal on the second standby +my $pgwaldir = $node_standby_4->data_dir. "/pg_wal"; +opendir my $dh, $pgwaldir or die "failed to open $pgwaldir"; +while (my $f = readdir($dh)) +{ + unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f"); +} +closedir($dh); + +$node_standby_4->start; + +# check if replication is working +$node_standby_3->psql('postgres', 'INSERT INTO t VALUES(1)'); +$node_standby_3->wait_for_catchup($node_standby_4, 'write', + $node_standby_3->lsn('insert')); -- 2.27.0
On Thu, May 20, 2021 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah, it will be a fake 1-element list. But just to be clear that > > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and > > nothing else, do you agree to this? Because we initialize > > recoveryTargetTLI to this value and we might change it in > > readRecoveryCommandFile() but for that, we must get the history file, > > so if we are talking about the case where we don't have the history > > file then "recoveryTargetTLI" will still be > > "ControlFile->checkPointCopy.ThisTimeLineID". > > I don't think your conclusion is correct. As I understand it, you're > talking about the state before > ee994272ca50f70b53074f0febaec97e28f83c4e, Right, I am talking about before this commit. because as of now > readRecoveryCommandFile() no longer exists in master. Before that > commit, StartupXLOG did this: > > recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; > readRecoveryCommandFile(); > expectedTLEs = readTimeLineHistory(recoveryTargetTLI); > > Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before > doing so, it will call existsTimeLineHistory if > recovery_target_timeline was set to an integer, and findNewestTimeLine > if it was set to latest. In the first case, existsTimeLineHistory() > calls RestoreArchivedFile(), but that restores it using a temporary > name, and KeepFileRestoredFromArchive is not called, Yes, I agree with this. so we might have > the timeline history in RECOVERYHISTORY but that's not the filename > we're actually going to try to read from inside readTimeLineHistory(). > In the second case, findNewestTimeLine() will call > existsTimeLineHistory() which results in the same situation. So I > think when readRecoveryCommandFile() returns expectedTLI can be > different but the history file can be absent since it was only ever > restored under a temporary name. I agree that readTimeLineHistory() will not look for that filename, but it will also try to get the file using (RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false)). So after we check the history file existence in existsTimeLineHistory(), if the file got removed from the archive (not sure how) then it is possible that now readTimeLineHistory() will not find that history file again. Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, May 21, 2021 at 7:51 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com > > > That leaves one case not covered: If you take a backup with plain > > "pg_basebackup" from a standby, without -X, and the first WAL segment > > contains a timeline switch (ie. you take the backup right after a > > failover), and you try to recover from it without a WAL archive, it > > doesn't work. This is the original issue that started this thread, > > except that I used "-x" in my original test case. The problem here is > > that even though streaming replication will fetch the timeline history > > file when it connects, at the very beginning of recovery, we expect that > > we already have the timeline history file corresponding the initial > > timeline available, either in pg_xlog or the WAL archive. By the time > > streaming replication has connected and fetched the history file, we've > > already initialized expectedTLEs to contain just the latest TLI. To fix > > that, we should delay calling readTimeLineHistoryFile() until streaming > > replication has connected and fetched the file. > > If the first segment read by recovery contains a timeline switch, the first > > pages have older timeline than segment timeline. However, if > > exepectedTLEs contained only the segment timeline, we cannot know if > > we can use the record. In that case the following error is issued. > > If expectedTLEs is initialized with the pseudo list, > tliOfPointInHistory always return the recoveryTargetTLI regardless of > the given LSN so the checking about timelines later doesn't work. And > later ReadRecord is surprised to see a page of an unknown timeline. From this whole discussion (on the thread given by you), IIUC the issue was that if the checkpoint LSN does not exist on the "ControlFile->checkPointCopy.ThisTimeLineID". If that is true then I agree that we will just initialize expectedTLE based on the online entry (ControlFile->checkPointCopy.ThisTimeLineID) and later we will fail to find the checkpoint record on this timeline because the checkpoint LSN is smaller than the start LSN of this timeline. Right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, May 21, 2021 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > so we might have > > the timeline history in RECOVERYHISTORY but that's not the filename > > we're actually going to try to read from inside readTimeLineHistory(). > > In the second case, findNewestTimeLine() will call > > existsTimeLineHistory() which results in the same situation. So I > > think when readRecoveryCommandFile() returns expectedTLI can be > > different but the history file can be absent since it was only ever > > restored under a temporary name. > > I agree that readTimeLineHistory() will not look for that filename, > but it will also try to get the file using (RestoreArchivedFile(path, > histfname, "RECOVERYHISTORY", 0, false)). So after we check the > history file existence in existsTimeLineHistory(), if the file got > removed from the archive (not sure how) then it is possible that now > readTimeLineHistory() will not find that history file again. Am I > missing something? That sounds right. I've lost the thread of what we're talking about here a bit. I think what we've established is that, when running a commit prior to ee994272ca50f70b53074f0febaec97e28f83c4e, if (a) recovery_target_tli is set, (b) restore_command works, and (c) nothing's being removed from the archive concurrently, then by the time StartupXLOG() does expectedTLEs = readTimeLineHistory(recoveryTargetTLI), any timeline history file that exists in the archive will have been restored, and the scenario ee994272ca50f70b53074f0febaec97e28f83c4e was concerned about won't occur. That's because it was concerned about a scenario where we failed to restore the history file until after we set expectedTLEs. Consequently, if we want to try to reproduce the problem fixed by that commit, we should look for a scenario that does not involve setting recovery_target_tli. Is that the conclusion you were driving towards? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 20, 2021 at 10:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > Conclusion: > > > - I think now we agree on the point that initializing expectedTLEs > > > with the recovery target timeline is the right fix. > > > - We still have some differences of opinion about what was the > > > original problem in the base code which was fixed by the commit > > > (ee994272ca50f70b53074f0febaec97e28f83c4e). > > > > I am also still concerned about whether we understand in exactly what > > cases the current logic doesn't work. We seem to more or less agree on > > the fix, but I don't think we really understand precisely what case we > > are fixing. > > Does the discussion above make sense? I had trouble following it completely, but I didn't really spot anything that seemed definitely wrong. However, I don't understand what it has to do with where we are now. What I want to understand is: under exactly what circumstances does it matter that WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, will stream from receiveTLI rather than recoveryTargetTLI? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > I had trouble following it completely, but I didn't really spot > anything that seemed definitely wrong. However, I don't understand > what it has to do with where we are now. What I want to understand is: > under exactly what circumstances does it matter that > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, > will stream from receiveTLI rather than recoveryTargetTLI? Ah ha! I think I figured it out. To hit this bug, you need to meet the following conditions: 1. Both streaming and archiving have to be configured. 2. You have to promote a new primary. 3. After promoting the new primary you have to start a new standby that doesn't have local WAL and for which the backup was taken from the previous timeline. In Dilip's original scenario, this new standby is actually the old primary, but that's not required. 4. The new standby has to be able to find the history file it needs in the archive but not the WAL files. 5. The new standby needs to have recovery_target_timeline='latest' (which is the default) When you start the new standby, it will fetch the current TLI from its control file. Then, since recovery_target_timeline=latest, the system will try to figure out the latest timeline, which only works because archiving is configured. There seems to be no provision for detecting the latest timeline via streaming. With archiving enabled, though, findNewestTimeLine() will be able to restore the history file created by the promotion of the new primary, which will cause validateRecoveryParameters() to change recoveryTargetTLI. Then we'll try to read the WAL segment containing the checkpoint record and fail because, by stipulation, only history files are available from the archive. Now, because streaming is also configured, we'll try streaming. That will work, so we'll be able to read the checkpoint record, but now, because WaitForWALToBecomeAvailable() initialized expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't switch to the correct timeline and it all goes wrong. The attached test script, test.sh seems to reliably reproduce this. Put that file and the recalcitrant_cp script, also attached, into an empty directory, cd to that directory, and run test.sh. Afterwards examine pgcascade.log. Basically, these scripts just set up the scenario described above. We set up primary and a standby that use recalcitrant_cp as the archive command, and because it's recalcitrant, it's only willing to copy history files, and always fails for WAL files.Then we create a cascading standby by taking a base backup from the standby, but before actually starting it, we promote the original standby. So now it meets all the conditions described above. I tried a couple variants of this test. If I switch the archive command from recalcitrant_cp to just regular cp, then there's no problem. And if I switch it to something that always fails, then there's also no problem. That's because, with either of those changes, condition (4) above is no longer met. In the first case, both files end up in the archive, and in the second case, neither file. What about hitting this in real life, with a real archive command? Well, you'd probably need the archive command to be kind of slow and get unlucky on the timing, but there's nothing to prevent it from happening. But, it will be WAY more likely if you have Dilip's original scenario, where you try to repurpose an old primary as a standby. It would normally be unlikely that the backup used to create a new standby would have an older TLI, because you typically wouldn't switch masters in between taking a base backup and using it to create a new standby. But the old master always has an older TLI. So (3) is satisfied. For (4) to be satisfied, you need the old master to fail to archive all of its WAL when it shuts down. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I had trouble following it completely, but I didn't really spot > > anything that seemed definitely wrong. However, I don't understand > > what it has to do with where we are now. What I want to understand is: > > under exactly what circumstances does it matter that > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, > > will stream from receiveTLI rather than recoveryTargetTLI? > > Ah ha! I think I figured it out. To hit this bug, you need to meet the > following conditions: > > 1. Both streaming and archiving have to be configured. > 2. You have to promote a new primary. > 3. After promoting the new primary you have to start a new standby > that doesn't have local WAL and for which the backup was taken from > the previous timeline. In Dilip's original scenario, this new standby > is actually the old primary, but that's not required. No, in my original scenario also the new standby was not old primary, I had 3 nodes node1-> primary, node2 -> standby1, node3-> standby2 node2 promoted as a new primary and node3's local WAL was removed (so that it has to stream checkpoint record from new primary and then remaining everything happens as you explain in remaining steps). > 4. The new standby has to be able to find the history file it needs in > the archive but not the WAL files. > 5. The new standby needs to have recovery_target_timeline='latest' > (which is the default) > > When you start the new standby, it will fetch the current TLI from its > control file. Then, since recovery_target_timeline=latest, the system > will try to figure out the latest timeline, which only works because > archiving is configured. There seems to be no provision for detecting > the latest timeline via streaming. With archiving enabled, though, > findNewestTimeLine() will be able to restore the history file created > by the promotion of the new primary, which will cause > validateRecoveryParameters() to change recoveryTargetTLI. Right Then we'll > try to read the WAL segment containing the checkpoint record and fail > because, by stipulation, only history files are available from the > archive. Now, because streaming is also configured, we'll try > streaming. That will work, so we'll be able to read the checkpoint > record, but now, because WaitForWALToBecomeAvailable() initialized > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't > switch to the correct timeline and it all goes wrong. exactly > The attached test script, test.sh seems to reliably reproduce this. > Put that file and the recalcitrant_cp script, also attached, into an > empty directory, cd to that directory, and run test.sh. Afterwards > examine pgcascade.log. Basically, these scripts just set up the > scenario described above. We set up primary and a standby that use > recalcitrant_cp as the archive command, and because it's recalcitrant, > it's only willing to copy history files, and always fails for WAL > files.Then we create a cascading standby by taking a base backup from > the standby, but before actually starting it, we promote the original > standby. So now it meets all the conditions described above. I tried a > couple variants of this test. If I switch the archive command from > recalcitrant_cp to just regular cp, then there's no problem. And if I > switch it to something that always fails, then there's also no > problem. That's because, with either of those changes, condition (4) > above is no longer met. In the first case, both files end up in the > archive, and in the second case, neither file. I haven't tested this, but I will do that. But now we are on the same page about the cause of the actual problem I reported. > What about hitting this in real life, with a real archive command? > Well, you'd probably need the archive command to be kind of slow and > get unlucky on the timing, but there's nothing to prevent it from > happening. Right > But, it will be WAY more likely if you have Dilip's original scenario, > where you try to repurpose an old primary as a standby. It would > normally be unlikely that the backup used to create a new standby > would have an older TLI, because you typically wouldn't switch masters > in between taking a base backup and using it to create a new standby. > But the old master always has an older TLI. So (3) is satisfied. For > (4) to be satisfied, you need the old master to fail to archive all of > its WAL when it shuts down. For my original case, both standby1 and standby2 are connected to the primary. Now, standby1 is promoted and standby2 is shut down. And, before restarting, all the local WAL of the standby2 is removed so that it can follow the new primary. The primary info and restore command for standby2 are changed as per the new primary(standby1). Now the scenario is that the standby1 has switched the timeline in the middle of the segment which contains the checkpoint record, so the segment with old TL is renamed to (.partial) and the same segment with new TL is not yet archived but the history file for the new TL has been archived. Now, when standby2 restart the remaining things happened as you explained, basically it restores the history file and changes the recoveryTargetTLI but it doesn't get the WAL file from the archive. So try to stream checkpoint record from the primary using "ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline. Now, we may ask that if the WAL segment with old TL on standby1(new primary) which contains the checkpoint is already renamed to ".partial" then how can it stream using the old TL then the answer is below code[1] in the walsender. Basically, the checkpoint record is present in both new and old TL as TL switched in the middle of the segment, it will send you the record from the new TL even if the walreciever asks to stream with old TL. Now walrecievr is under impression that it has read from the old TL. And, we know the rest of the story that we will set the expectedTLEs based on the old history file and never be able to go to the new TL. Anyways now we understand the issue and there are many ways we can reproduce it. Still, I thought of explaining the exact steps how it happed for me because now we understand it well so I think it is easy to explain :) [1] WalSndSegmentOpen() { /*------- * When reading from a historic timeline, and there is a timeline switch * within this segment, read from the WAL segment belonging to the new * timeline. } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > The attached test script, test.sh seems to reliably reproduce this. > > Put that file and the recalcitrant_cp script, also attached, into an > > I haven't tested this, but I will do that. But now we are on the same > page about the cause of the actual problem I reported. Now, I have tested. I am able to reproduce the actual problem with your script. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I had trouble following it completely, but I didn't really spot > > > anything that seemed definitely wrong. However, I don't understand > > > what it has to do with where we are now. What I want to understand is: > > > under exactly what circumstances does it matter that > > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, > > > will stream from receiveTLI rather than recoveryTargetTLI? > > > > Ah ha! I think I figured it out. To hit this bug, you need to meet the > > following conditions: > > > > 1. Both streaming and archiving have to be configured. > > 2. You have to promote a new primary. > > 3. After promoting the new primary you have to start a new standby > > that doesn't have local WAL and for which the backup was taken from > > the previous timeline. In Dilip's original scenario, this new standby > > is actually the old primary, but that's not required. > > No, in my original scenario also the new standby was not old primary, > I had 3 nodes > node1-> primary, node2 -> standby1, node3-> standby2 > node2 promoted as a new primary and node3's local WAL was removed (so > that it has to stream checkpoint record from new primary and then > remaining everything happens as you explain in remaining steps). > > > 4. The new standby has to be able to find the history file it needs in > > the archive but not the WAL files. > > 5. The new standby needs to have recovery_target_timeline='latest' > > (which is the default) > > > > When you start the new standby, it will fetch the current TLI from its > > control file. Then, since recovery_target_timeline=latest, the system > > will try to figure out the latest timeline, which only works because > > archiving is configured. There seems to be no provision for detecting > > the latest timeline via streaming. With archiving enabled, though, > > findNewestTimeLine() will be able to restore the history file created > > by the promotion of the new primary, which will cause > > validateRecoveryParameters() to change recoveryTargetTLI. > > Right > > Then we'll > > try to read the WAL segment containing the checkpoint record and fail > > because, by stipulation, only history files are available from the > > archive. Now, because streaming is also configured, we'll try > > streaming. That will work, so we'll be able to read the checkpoint > > record, but now, because WaitForWALToBecomeAvailable() initialized > > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't > > switch to the correct timeline and it all goes wrong. > > exactly > > > The attached test script, test.sh seems to reliably reproduce this. > > Put that file and the recalcitrant_cp script, also attached, into an > > empty directory, cd to that directory, and run test.sh. Afterwards > > examine pgcascade.log. Basically, these scripts just set up the > > scenario described above. We set up primary and a standby that use > > recalcitrant_cp as the archive command, and because it's recalcitrant, > > it's only willing to copy history files, and always fails for WAL > > files.Then we create a cascading standby by taking a base backup from > > the standby, but before actually starting it, we promote the original > > standby. So now it meets all the conditions described above. I tried a > > couple variants of this test. If I switch the archive command from > > recalcitrant_cp to just regular cp, then there's no problem. And if I > > switch it to something that always fails, then there's also no > > problem. That's because, with either of those changes, condition (4) > > above is no longer met. In the first case, both files end up in the > > archive, and in the second case, neither file. > > I haven't tested this, but I will do that. But now we are on the same > page about the cause of the actual problem I reported. > > > What about hitting this in real life, with a real archive command? > > Well, you'd probably need the archive command to be kind of slow and > > get unlucky on the timing, but there's nothing to prevent it from > > happening. > > Right > > > But, it will be WAY more likely if you have Dilip's original scenario, > > where you try to repurpose an old primary as a standby. It would > > normally be unlikely that the backup used to create a new standby > > would have an older TLI, because you typically wouldn't switch masters > > in between taking a base backup and using it to create a new standby. > > But the old master always has an older TLI. So (3) is satisfied. For > > (4) to be satisfied, you need the old master to fail to archive all of > > its WAL when it shuts down. > > For my original case, both standby1 and standby2 are connected to the > primary. Now, standby1 is promoted and standby2 is shut down. And, > before restarting, all the local WAL of the standby2 is removed so > that it can follow the new primary. The primary info and restore > command for standby2 are changed as per the new primary(standby1). > > Now the scenario is that the standby1 has switched the timeline in the > middle of the segment which contains the checkpoint record, so the > segment with old TL is renamed to (.partial) and the same segment with > new TL is not yet archived but the history file for the new TL has > been archived. > > Now, when standby2 restart the remaining things happened as you > explained, basically it restores the history file and changes the > recoveryTargetTLI but it doesn't get the WAL file from the archive. > So try to stream checkpoint record from the primary using > "ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline. > > Now, we may ask that if the WAL segment with old TL on standby1(new > primary) which contains the checkpoint is already renamed to > ".partial" then how can it stream using the old TL then the answer is > below code[1] in the walsender. Basically, the checkpoint record is > present in both new and old TL as TL switched in the middle of the > segment, it will send you the record from the new TL even if the > walreciever asks to stream with old TL. Now walrecievr is under > impression that it has read from the old TL. And, we know the rest of > the story that we will set the expectedTLEs based on the old history > file and never be able to go to the new TL. > > Anyways now we understand the issue and there are many ways we can > reproduce it. Still, I thought of explaining the exact steps how it > happed for me because now we understand it well so I think it is easy > to explain :) > > [1] > WalSndSegmentOpen() > { > /*------- > * When reading from a historic timeline, and there is a timeline switch > * within this segment, read from the WAL segment belonging to the new > * timeline. > } > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, May 22, 2021 at 12:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > No, in my original scenario also the new standby was not old primary, > I had 3 nodes > node1-> primary, node2 -> standby1, node3-> standby2 > node2 promoted as a new primary and node3's local WAL was removed (so > that it has to stream checkpoint record from new primary and then > remaining everything happens as you explain in remaining steps). Oh, OK. I misunderstood. I think it could happen that way, though. > I haven't tested this, but I will do that. But now we are on the same > page about the cause of the actual problem I reported. Yeah, sorry, I just didn't understand the exact chain of events before. > For my original case, both standby1 and standby2 are connected to the > primary. Now, standby1 is promoted and standby2 is shut down. And, > before restarting, all the local WAL of the standby2 is removed so > that it can follow the new primary. The primary info and restore > command for standby2 are changed as per the new primary(standby1). One thing I don't understand is why the final WAL segment from the original primary didn't end up in the archive in this scenario. If it had, then we would not have seen the issue in that case. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > For my original case, both standby1 and standby2 are connected to the > > primary. Now, standby1 is promoted and standby2 is shut down. And, > > before restarting, all the local WAL of the standby2 is removed so > > that it can follow the new primary. The primary info and restore > > command for standby2 are changed as per the new primary(standby1). > > One thing I don't understand is why the final WAL segment from the > original primary didn't end up in the archive in this scenario. If it > had, then we would not have seen the issue in that case. I used different archive folders for primary and new primary(standby1). I have modified your test.sh slightly (modified test2.sh attached) so that I can demonstrate my scenario where I was seeing the issue and this is getting fixed after putting the fix we discussed[1] [1] - expectedTLEs = readTimeLineHistory(receiveTLI); + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote: I have created a tap test based on Robert's test.sh script. It reproduces the issue. I am new with perl so this still needs some cleanup/improvement, but at least it shows the idea. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
At Fri, 21 May 2021 12:52:54 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > I had trouble following it completely, but I didn't really spot > anything that seemed definitely wrong. However, I don't understand > what it has to do with where we are now. What I want to understand is: > under exactly what circumstances does it matter that > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, > will stream from receiveTLI rather than recoveryTargetTLI? Extracing related descriptions from my previous mail, - recoveryTargetTimeLine is initialized with ControlFile->checkPointCopy.ThisTimeLineID - readRecoveryCommandFile(): ...or in the case of latest, move it forward up to the maximum timeline among the history files found in either pg_wal or archive. - ReadRecord...XLogFileReadAnyTLI Tries to load the history file for recoveryTargetTLI either from pg_wal or archive onto local TLE list, if the history file is not found, use a generateed list with one entry for the recoveryTargetTLI. (b) If such a segment is *not* found, expectedTLEs is left NIL. Usually recoveryTargetTLI is equal to the last checkpoint TLI. (c) However, in the case where timeline switches happened in the segment and the recoveryTargetTLI has been increased, that is, the history file for the recoveryTargetTLI is found in pg_wal or archive, that is, the issue raised here, recoveryTargetTLI becomes the future timline of the checkpoint TLI. - WaitForWALToBecomeAvailable In the case of (c) recoveryTargetTLI > checkpoint TLI. In this case we expecte that checkpint TLI is in the history of recoveryTargetTLI. Otherwise recovery failse^h. This case is similar to the case (a) but the relationship between recoveryTargetTLI and the checkpoint TLI is not confirmed yet. ReadRecord barks later if they are not compatible so there's not a serious problem but might be better checking the relation ship there. My first proposal performed mutual check between the two but we need to check only unidirectionally. === So the condition for the Dilip's case is, as you wrote in another mail: - ControlFile->checkPointCopy.ThisTimeLineID is in the older timeline. - Archive or pg_wal offers the history file for the newer timeline. - The segment for the checkpoint is not found in pg_wal nor in archive. That is, - A grandchild(c) node is stopped - Then the child node(b) is promoted. - Clear pg_wal directory of (c) then connect it to (b) *before* (b) archives the segment for the newer timeline of the timeline-switching segments. (if we have switched at segment 3, TLI=1, the segment file of the older timeline is renamed to .partial, then create the same segment for TLI=2. The former is archived while promotion is performed but the latter won't be archive until the segment ends.) The orinal case of after the commit ee994272ca, - recoveryTargetTimeLine is initialized with ControlFile->checkPointCopy.ThisTimeLineID (X) (Before the commit, we created the one-entry expectedTLEs consists only of ControlFile->checkPointCopy.ThisTimeLineID.) - readRecoveryCommandFile(): Move recoveryTargetTLI forward to the specified target timline if the history file for the timeline is found, or in the case of latest, move it forward up to the maximum timeline among the history files found in either pg_wal or archive. - ReadRecord...XLogFileReadAnyTLI Tries to load the history file for recoveryTargetTLI either from pg_wal or archive onto local TLE list, if the history file is not found, use a generateed list with one entry for the recoveryTargetTLI. (b) If such a segment is *not* found, expectedTLEs is left NIL. Usually recoveryTargetTLI is equal to the last checkpoint TLI. - WaitForWALToBecomeAvailable if we have had no segments for the last checkpoint, initiate streaming from the REDO point of the last checkpoint. We should have all history files until receiving segment data. after sufficient WAL data has been received, the only cases where expectedTLEs is still NIL are the (b) and (c) above. In the case of (b) recoveryTargetTLI == checkpoint TLI. So I thought that the commit fixed this scenario. Even in this case, ReadRecord fails because the checkpoint segment contains pages for the older timeline which is not in expectedTLEs if we did (X). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I have created a tap test based on Robert's test.sh script. It > reproduces the issue. I am new with perl so this still needs some > cleanup/improvement, but at least it shows the idea. I'm not sure I'm following the discussion here, however, if we were trying to reproduce Dilip's case using base backup, we would need such a broken archive command if using pg_basebackup witn -Xnone. Becuase the current version of pg_basebackup waits for all required WAL segments to be archived when connecting to a standby with -Xnone. I don't bother reconfirming the version that fix took place, but just using -X stream instead of "none" we successfully miss the first segment of the new timeline in the upstream archive, though we need to erase pg_wal in the backup. Either the broken archive command or erasing pg_wal of the cascade is required to the behavior to occur. The attached is how it looks like. -- Kyotaro Horiguchi NTT Open Source Software Center # Copyright (c) 2021, PostgreSQL Global Development Group # Minimal test testing streaming replication use Cwd; use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 1; # Initialize primary node my $node_primary = get_new_node('primary'); # A specific role is created to perform some tests related to replication, # and it needs proper authentication configuration. $node_primary->init(allows_streaming => 1); $node_primary->append_conf( 'postgresql.conf', qq( wal_keep_size=128MB )); $node_primary->start; my $backup_name = 'my_backup'; # Take backup $node_primary->backup($backup_name); my $node_standby_1 = get_new_node('standby_1'); $node_standby_1->init_from_backup($node_primary, $backup_name, allows_streaming => 1, has_streaming => 1); my $archivedir_standby_1 = $node_standby_1->archive_dir; $node_standby_1->append_conf( 'postgresql.conf', qq( archive_mode=always archive_command='cp "%p" "$archivedir_standby_1/%f"' )); $node_standby_1->start; # Take backup of standby 1 # NB: Use -Xnone so that pg_wal is empty. #$node_standby_1->backup($backup_name, backup_options => ['-Xnone']); $node_standby_1->backup($backup_name); # Promote the standby. $node_standby_1->psql('postgres', 'SELECT pg_promote()'); # clean up pg_wal from the backup my $pgwaldir = $node_standby_1->backup_dir. "/" . $backup_name . "/pg_wal"; opendir my $dh, $pgwaldir or die "failed to open $pgwaldir"; while (my $f = readdir($dh)) { unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f"); } closedir($dh); # Create cascading standby but don't start it yet. # NB: Must set up both streaming and archiving. my $node_cascade = get_new_node('cascade'); $node_cascade->init_from_backup($node_standby_1, $backup_name, has_streaming => 1); $node_cascade->append_conf( 'postgresql.conf', qq( restore_command = 'cp "$archivedir_standby_1/%f" "%p"' log_line_prefix = '%m [%p:%b] %q%a ' archive_mode=off )); # Start cascade node $node_cascade->start; # Create some content on primary and check its presence in standby 1 $node_standby_1->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT 1 AS a"); # Wait for standbys to catch up $node_standby_1->wait_for_catchup($node_cascade, 'replay', $node_standby_1->lsn('replay')); ok(1, 'test'); # it's sucess if we come here.
On Mon, May 24, 2021 at 10:17 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I have created a tap test based on Robert's test.sh script. It > > reproduces the issue. I am new with perl so this still needs some > > cleanup/improvement, but at least it shows the idea. > > I'm not sure I'm following the discussion here, however, if we were > trying to reproduce Dilip's case using base backup, we would need such > a broken archive command if using pg_basebackup witn -Xnone. Becuase > the current version of pg_basebackup waits for all required WAL > segments to be archived when connecting to a standby with -Xnone. Right, that's the reason if you see my patch I have dynamically generated such archive command which skips everything other than the history file see below snippet from my patch, where I am generating a skip_cp command and then I am using that as an archive command. == +# Prepare a alternative archive command to skip WAL files +my $script = "#!/usr/bin/perl \n +use File::Copy; \n +my (\$source, \$target) = \@ARGV; \n +if (\$source =~ /history/) \n +{ \n + copy(\$source, \$target); \n +}"; + +open my $fh, '>', "skip_cp"; +print {$fh} $script; === I > don't bother reconfirming the version that fix took place, but just > using -X stream instead of "none" we successfully miss the first > segment of the new timeline in the upstream archive, though we need to > erase pg_wal in the backup. Either the broken archive command or > erasing pg_wal of the cascade is required to the behavior to occur. > > The attached is how it looks like. I will test this and let you know. Thanks! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, May 23, 2021 at 12:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have created a tap test based on Robert's test.sh script. It > reproduces the issue. I am new with perl so this still needs some > cleanup/improvement, but at least it shows the idea. Thanks. I think this is the right idea but just needs a few adjustments. I don't think that dynamically writing out a file into the current working directory of the script is the right approach. Instead I think we should be planning to check this file into the repository and then have the test script find it. Now the trick is how to do that in a portable way. I think we can probably use the same idea that the pg_rewind tests use to find a perl module located in the test directory. That is: use FindBin; and then use $FindBin::RealBin to construct a path name to the executable, e.g. $node_primary->append_conf( 'postgresql.conf', qq( archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"' )); This avoids issues such as: leaving behind files if the script is terminated, needing the current working directory to be writable, possible permissions issues with the new file under Windows or SE-Linux. The restore_command needs to be "cp" on Linux but "copy" on Windows. Maybe you can use PostgresNode.pm's enable_restoring? Or if that doesn't work, then you need to mimic the logic, as src/test/recovery/t/020_archive_status.pl does for archive_command. Why do you set log_line_prefix? Is that needed? Why are the nodes called standby_1 and cascade? Either use standby and cascade or standby_1 and standby_2. There is a comment that says "Create some content on primary and check its presence in standby 1" but it only creates the content, and does not check anything. I think we don't really need to do any of this, but at least the code and the comment have to match. Let's not call the command skip_cp. It's not very descriptive. If you don't like recalcitrant_cp, then maybe something like cp_history_files or so. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 25, 2021 at 9:16 PM Robert Haas <robertmhaas@gmail.com> wrote: > use FindBin; > > and then use $FindBin::RealBin to construct a path name to the executable, e.g. > > $node_primary->append_conf( > 'postgresql.conf', qq( > archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"' > )); > > This avoids issues such as: leaving behind files if the script is > terminated, needing the current working directory to be writable, > possible permissions issues with the new file under Windows or > SE-Linux. Done > The restore_command needs to be "cp" on Linux but "copy" on Windows. > Maybe you can use PostgresNode.pm's enable_restoring? Or if that > doesn't work, then you need to mimic the logic, as > src/test/recovery/t/020_archive_status.pl does for archive_command. Done > Why do you set log_line_prefix? Is that needed? No, it was not, removed > Why are the nodes called standby_1 and cascade? Either use standby and > cascade or standby_1 and standby_2. Fixed > There is a comment that says "Create some content on primary and check > its presence in standby 1" but it only creates the content, and does > not check anything. I think we don't really need to do any of this, > but at least the code and the comment have to match. I think we need to create some content on promoted standby and check whether the cascade standby is able to get that or not, that will guarantee that it is actually following the promoted standby, I have added the test for that so that it matches the comments. > Let's not call the command skip_cp. It's not very descriptive. If you > don't like recalcitrant_cp, then maybe something like cp_history_files > or so. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think we need to create some content on promoted standby and check > whether the cascade standby is able to get that or not, that will > guarantee that it is actually following the promoted standby, I have > added the test for that so that it matches the comments. OK, but I ran this test against an unpatched server and it passed. That's not so great, because the test should fail without the bug fix. It seems to be because there's only one actual test in this test case. Looking at the log file, src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the only "ok" nor "not ok" line is: ok 1 - check streamed content on cascade standby So either that test is not right or some other test is needed. I think there's something else going wrong here, because when I run my original test.sh script, I see this: 2021-05-26 11:37:47.794 EDT [57961] LOG: restored log file "00000002.history" from archive ... 2021-05-26 11:37:47.916 EDT [57961] LOG: redo starts at 0/2000028 ... 2021-05-26 11:37:47.927 EDT [57966] LOG: replication terminated by primary server 2021-05-26 11:37:47.927 EDT [57966] DETAIL: End of WAL reached on timeline 1 at 0/3000000 But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log file generated by this test case: cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000002.history: No such file or directory ... 2021-05-26 11:41:08.149 EDT [63347] LOG: fetching timeline history file for timeline 2 from primary server ... 2021-05-26 11:41:08.288 EDT [63344] LOG: new target timeline is 2 ... 2021-05-26 11:41:08.303 EDT [63344] LOG: redo starts at 0/2000028 ... 2021-05-26 11:41:08.331 EDT [63347] LOG: restarted WAL streaming at 0/3000000 on timeline 2 So it doesn't seem like the test is actually reproducing the problem correctly. The timeline history file isn't available from the archive, so it streams it, and then the problem doesn't occur. I guess that's because there's nothing to guarantee that the history file reaches the archive before 'cascade' is started. The code just does: # Promote the standby. $node_standby->psql('postgres', 'SELECT pg_promote()'); # Start cascade node $node_cascade->start; ...which has a clear race condition. src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a WAL file to be archived, so maybe we can steal that logic and use it here. I suggest we rename the test to something a bit more descriptive. Like instead of 025_timeline_issue.pl, perhaps 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but "timeline issue" is a bit too vague for my taste. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 9:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think we need to create some content on promoted standby and check > > whether the cascade standby is able to get that or not, that will > > guarantee that it is actually following the promoted standby, I have > > added the test for that so that it matches the comments. > > OK, but I ran this test against an unpatched server and it passed. > That's not so great, because the test should fail without the bug fix. > It seems to be because there's only one actual test in this test case. > Looking at the log file, > src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the > only "ok" nor "not ok" line is: > > ok 1 - check streamed content on cascade standby > > So either that test is not right or some other test is needed. I think > there's something else going wrong here, because when I run my > original test.sh script, I see this: Thats strange, when I ran the test I can see below in log of cascade node (which shows that cascade get the history file but not the WAL file and then it select the old timeline and never go to the new timeline) ... 2021-05-26 21:46:54.412 IST [84080] LOG: restored log file "00000002.history" from archive cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’: No such file or directory 2021-05-26 21:46:54.415 IST [84080] LOG: entering standby mode 2021-05-26 21:46:54.419 IST [84080] LOG: restored log file "00000002.history" from archive ..... 2021-05-26 21:46:54.429 IST [84085] LOG: started streaming WAL from primary at 0/2000000 on timeline 1 -> stream using previous TL 2021-05-26 21:46:54.466 IST [84080] LOG: redo starts at 0/2000028 2021-05-26 21:46:54.466 IST [84080] LOG: consistent recovery state reached at 0/3000000 2021-05-26 21:46:54.467 IST [84079] LOG: database system is ready to accept read only connections 2021-05-26 21:46:54.483 IST [84085] LOG: replication terminated by primary server 2021-05-26 21:46:54.483 IST [84085] DETAIL: End of WAL reached on timeline 1 at 0/3000000. cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’: No such file or directory cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/000000010000000000000003’: No such file or directory 2021-05-26 21:46:54.498 IST [84085] LOG: primary server contains no more WAL on requested timeline 1 <failure continues as it never go to timeline 2> And finally the test case fails because the cascade can never get the changes. I will check if there is any timing dependency in the test case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I will check if there is any timing dependency in the test case. There is. I explained it in the second part of my email, which you may have failed to notice. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I will check if there is any timing dependency in the test case.
There is. I explained it in the second part of my email, which you may
have failed to notice.
Sorry, my bad. I got your point now. I will change the test.
At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> > > wrote: > > > I will check if there is any timing dependency in the test case. > > > > There is. I explained it in the second part of my email, which you may > > have failed to notice. > > > Sorry, my bad. I got your point now. I will change the test. I didn't noticed that but that is actually possible to happen. By the way I'm having a hard time understanding what was happening on this thread. In the very early in this thread I posted a test script that exactly reproduces Dilip's case by starting from two standbys based on his explanation. But *we* didn't understand what the original commit ee994272ca intended and I understood that we wanted to know it. So in the mail [1] and [2] I tried to describe what's going on around the two issues. Although I haven't have a response to [2], can I think that we clarified the intention of ee994272ca? And may I think that we decided that we don't add a test for the commit? Then it seems to me that Robert refound how to reproduce Dilip's case using basebackup instead of using two standbys. It is using a broken archive_command with pg_basebackup -Xnone and I showed that the same resulting state is available by pg_basebackup -Xstream/fetch clearing pg_wal directory of the resulting backup including an explanation of why. *I* think that it is better to avoid to have the archive_command since it seems to me that just unlinking some files seems simpler tha having the broken archive_command. However, since Robert ignored it, I guess that Robert thinks that the broken archive_command is better than that. It my understanding above about the current status of this thread is right? FWIW, regarding to the name of the test script, putting aside what it actually does, I proposed to place it as a part or 004_timeline_switch.pl because this issue is related to timeline switching. [1] 20210521.112105.27166595366072396.horikyota.ntt@gmail.com [2] https://www.postgresql.org/message-id/20210524.113402.1922481024406047229.horikyota.ntt@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, May 27, 2021 at 6:19 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> > > > wrote: > > > > I will check if there is any timing dependency in the test case. > > > > > > There is. I explained it in the second part of my email, which you may > > > have failed to notice. > > > > > > Sorry, my bad. I got your point now. I will change the test. > > I didn't noticed that but that is actually possible to happen. > > > By the way I'm having a hard time understanding what was happening on > this thread. > > In the very early in this thread I posted a test script that exactly > reproduces Dilip's case by starting from two standbys based on his > explanation. But *we* didn't understand what the original commit > ee994272ca intended and I understood that we wanted to know it. > > So in the mail [1] and [2] I tried to describe what's going on around > the two issues. Although I haven't have a response to [2], can I > think that we clarified the intention of ee994272ca? And may I think > that we decided that we don't add a test for the commit? > > Then it seems to me that Robert refound how to reproduce Dilip's case > using basebackup instead of using two standbys. It is using a broken > archive_command with pg_basebackup -Xnone and I showed that the same > resulting state is available by pg_basebackup -Xstream/fetch clearing > pg_wal directory of the resulting backup including an explanation of > why. Maybe we can somehow achieve that without a broken archive command, but I am not sure how it is enough to just delete WAL from pg_wal? I mean my original case was that 1. Got the new history file from the archive but did not get the WAL file yet which contains the checkpoint after TL switch 2. So the standby2 try to stream using new primary using old TL and set the wrong TL in expectedTLEs But if you are not doing anything to stop archiving WAL files or to guarantee that WAL has come to archive and you deleted those then I am not sure how we are reproducing the original problem. BTW, I have also tested your script and I found below log, which shows that standby2 is successfully able to select the timeline2 so it is not reproducing the issue. Am I missing something? --standby-1-- 2021-05-27 10:45:35.866 IST [5096] LOG: last completed transaction was at log time 2021-05-27 10:45:35.699316+05:30 2021-05-27 10:45:35.867 IST [5096] LOG: selected new timeline ID: 2 2021-05-27 10:45:35.882 IST [5096] LOG: archive recovery complete 2021-05-27 10:45:35.911 IST [5095] LOG: database system is ready to accept connections 2021-05-27 10:45:36.096 IST [5134] standby_2 LOG: received replication command: IDENTIFY_SYSTEM 2021-05-27 10:45:36.096 IST [5134] standby_2 STATEMENT: IDENTIFY_SYSTEM --standby-2-- 2021-05-27 10:45:36.089 IST [5129] LOG: entering standby mode 2021-05-27 10:45:36.090 IST [5129] LOG: redo starts at 0/2000028 2021-05-27 10:45:36.092 IST [5129] LOG: consistent recovery state reached at 0/3030320 2021-05-27 10:45:36.092 IST [5129] LOG: invalid record length at 0/3030320: wanted 24, got 0 2021-05-27 10:45:36.092 IST [5128] LOG: database system is ready to accept read only connections 2021-05-27 10:45:36.096 IST [5133] LOG: fetching timeline history file for timeline 2 from primary server 2021-05-27 10:45:36.097 IST [5133] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2021-05-27 10:45:36.098 IST [5133] LOG: replication terminated by primary server 2021-05-27 10:45:36.098 IST [5133] DETAIL: End of WAL reached on timeline 1 at 0/3030320. 2021-05-27 10:45:36.098 IST [5129] LOG: new target timeline is 2 2021-05-27 10:45:36.098 IST [5133] LOG: restarted WAL streaming at 0/3000000 on timeline 2 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 9:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > ...which has a clear race condition. > src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a > WAL file to be archived, so maybe we can steal that logic and use it > here. Yeah, done that, I think we can use exact same logic for history files as well because if wal file is archived then history file must be because a) history file get created during promotion so created before WAL file with new TL is ready for archive b) Archiver archive history files before archiving any WAL files. src/test/recovery/t/025_stuck_on_old_timeline.pl > I suggest we rename the test to something a bit more descriptive. Like > instead of 025_timeline_issue.pl, perhaps > 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but > "timeline issue" is a bit too vague for my taste. Changed as suggested. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > Maybe we can somehow achieve that without a broken archive command, > but I am not sure how it is enough to just delete WAL from pg_wal? I > mean my original case was that > 1. Got the new history file from the archive but did not get the WAL > file yet which contains the checkpoint after TL switch > 2. So the standby2 try to stream using new primary using old TL and > set the wrong TL in expectedTLEs > > But if you are not doing anything to stop archiving WAL files or to > guarantee that WAL has come to archive and you deleted those then I am > not sure how we are reproducing the original problem. Thanks for the reply! We're writing at the very beginning of the switching segment at the promotion time. So it is guaranteed that the first segment of the newer timline won't be archived until the rest almost 16MB in the segment is consumed or someone explicitly causes a segment switch (including archive timeout). > BTW, I have also tested your script and I found below log, which shows > that standby2 is successfully able to select the timeline2 so it is > not reproducing the issue. Am I missing something? standby_2? My last one 026_timeline_issue_2.pl doesn't use that name and uses "standby_1 and "cascade". In the ealier ones, standby_4 and 5 (or 3 and 4 in the later versions) are used in ths additional tests. So I think it shold be something different? -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > Maybe we can somehow achieve that without a broken archive command, > > but I am not sure how it is enough to just delete WAL from pg_wal? I > > mean my original case was that > > 1. Got the new history file from the archive but did not get the WAL > > file yet which contains the checkpoint after TL switch > > 2. So the standby2 try to stream using new primary using old TL and > > set the wrong TL in expectedTLEs > > > > But if you are not doing anything to stop archiving WAL files or to > > guarantee that WAL has come to archive and you deleted those then I am > > not sure how we are reproducing the original problem. > > Thanks for the reply! > > We're writing at the very beginning of the switching segment at the > promotion time. So it is guaranteed that the first segment of the > newer timline won't be archived until the rest almost 16MB in the > segment is consumed or someone explicitly causes a segment switch > (including archive timeout). I agree > > BTW, I have also tested your script and I found below log, which shows > > that standby2 is successfully able to select the timeline2 so it is > > not reproducing the issue. Am I missing something? > > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name > and uses "standby_1 and "cascade". In the ealier ones, standby_4 and > 5 (or 3 and 4 in the later versions) are used in ths additional tests. > > So I think it shold be something different? Yeah, I tested with your patch where you had a different test case, with "026_timeline_issue_2.pl", I am able to reproduce the issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
At Thu, 27 May 2021 12:47:30 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > We're writing at the very beginning of the switching segment at the > > promotion time. So it is guaranteed that the first segment of the > > newer timline won't be archived until the rest almost 16MB in the > > segment is consumed or someone explicitly causes a segment switch > > (including archive timeout). > > I agree > > > > BTW, I have also tested your script and I found below log, which shows > > > that standby2 is successfully able to select the timeline2 so it is > > > not reproducing the issue. Am I missing something? > > > > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name > > and uses "standby_1 and "cascade". In the ealier ones, standby_4 and > > 5 (or 3 and 4 in the later versions) are used in ths additional tests. > > > > So I think it shold be something different? > > Yeah, I tested with your patch where you had a different test case, > with "026_timeline_issue_2.pl", I am able to reproduce the issue. That said, I don't object if we decide to choose the crafted archive command as far as we consider the trade-offs between the two ways. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > So in the mail [1] and [2] I tried to describe what's going on around > the two issues. Although I haven't have a response to [2], can I > think that we clarified the intention of ee994272ca? And may I think > that we decided that we don't add a test for the commit? Regarding the first question, I feel that the intention of ee994272ca is fairly clear at this point. Someone else might feel differently so I won't presume to speak for anyone but me. Regarding the second question, I am not opposed to adding a test for that commit, but I think it is a lot more important to fix the bug we have now than to add a test for a bug that was fixed a long time ago. > Then it seems to me that Robert refound how to reproduce Dilip's case > using basebackup instead of using two standbys. It is using a broken > archive_command with pg_basebackup -Xnone and I showed that the same > resulting state is available by pg_basebackup -Xstream/fetch clearing > pg_wal directory of the resulting backup including an explanation of > why. Yes, it makes sense that we could get to the same state either by not fetching the WAL in the first place, or alternatively by fetching it and then removing it. > *I* think that it is better to avoid to have the archive_command since > it seems to me that just unlinking some files seems simpler tha having > the broken archive_command. However, since Robert ignored it, I guess > that Robert thinks that the broken archive_command is better than > that. Well ... I don't see those things as quite related. As far as I can see, unlinking files from pg_wal is an alternative to using -Xnone. On the other hand, the broken archive_command is there to make sure the new primary doesn't archive its WAL segment too soon. Regarding the first point, I think using -Xnone is better than using -Xfetch/stream and then removing the WAL, because (1) it doesn't seem efficient to fetch WAL only to turn around and remove it and (2) someone might question whether removing the WAL afterward is a supported procedure, whereas using an option built into the tool must surely be supported. Regarding the second point, I think using the broken archive command is superior because we can be sure of the behavior. If we just rely on not having crossed a segment boundary, then anything that causes more WAL to be generated than we are expecting could break the test. I don't think it's particularly likely in a case like this that autovacuum or any other thing would kick in and generate extra WAL, but the broken archive command ensures that even if it does happen, the test will still work as intended. That, to me, seems like a good enough reason to do it that way. > FWIW, regarding to the name of the test script, putting aside what it > actually does, I proposed to place it as a part or > 004_timeline_switch.pl because this issue is related to timeline > switching. I think it is better to keep it separate. Long test scripts that test multiple things with completely separate tests are hard to read. Kyotaro-san, I hope I have not given any offense. I am doing my best, and certainly did not mean to be rude. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Horiguchi-san, In a project I helped with, I encountered an issue where the archive command kept failing. I thought this issue was related to the problem in this thread, so I'm sharing it here. If I should create a new thread, please let me know. * Problem - The archive_command is failed always. * Conditions under which the problem occurs (parameters) - archive_mode=always - Using the test command in archive_command * Probable cause - I guess that is because the .history file already exists, and the test command fails. (but if we use archive_mode=on, archive_command is successful). * How to reproduce - Attached is a script to reproduce the problem. Note: the script will remove $PGDATA when it started The test command is listed as an example of the use of archive_command in postgresql.conf, and the project faced this problem because it used the example as is. If this behavior is a specification, it would be better not to write the test command as a usage example. Or maybe there should be a note that the test command should not be used when archive_mode=always. Maybe, I'm missing something, sorry. Regards, Tatsuro Yamada
Вложения
Thanks! At Thu, 27 May 2021 15:05:44 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > So in the mail [1] and [2] I tried to describe what's going on around > > the two issues. Although I haven't have a response to [2], can I > > think that we clarified the intention of ee994272ca? And may I think > > that we decided that we don't add a test for the commit? > > Regarding the first question, I feel that the intention of ee994272ca > is fairly clear at this point. Someone else might feel differently so > I won't presume to speak for anyone but me. I completely misunderstood your intention here. > Regarding the second question, I am not opposed to adding a test for > that commit, but I think it is a lot more important to fix the bug we > have now than to add a test for a bug that was fixed a long time ago. Yes. I agree to that. Glad to see that. > > Then it seems to me that Robert refound how to reproduce Dilip's case > > using basebackup instead of using two standbys. It is using a broken > > archive_command with pg_basebackup -Xnone and I showed that the same > > resulting state is available by pg_basebackup -Xstream/fetch clearing > > pg_wal directory of the resulting backup including an explanation of > > why. > > Yes, it makes sense that we could get to the same state either by not > fetching the WAL in the first place, or alternatively by fetching it > and then removing it. Sure. That is an opinion and I can agree to that. > > *I* think that it is better to avoid to have the archive_command since > > it seems to me that just unlinking some files seems simpler tha having > > the broken archive_command. However, since Robert ignored it, I guess > > that Robert thinks that the broken archive_command is better than > > that. > > Well ... I don't see those things as quite related. As far as I can > see, unlinking files from pg_wal is an alternative to using -Xnone. On > the other hand, the broken archive_command is there to make sure the > new primary doesn't archive its WAL segment too soon. I agree to use the archive_command just to create the desired state. > Regarding the first point, I think using -Xnone is better than using > -Xfetch/stream and then removing the WAL, because (1) it doesn't seem > efficient to fetch WAL only to turn around and remove it and (2) > someone might question whether removing the WAL afterward is a > supported procedure, whereas using an option built into the tool must > surely be supported. Mmmm. That looks like meaning that we don't intend to support the Dilip's case, and means that we support the use of archive-command-copies-only-other-than-wal-segments? > Regarding the second point, I think using the broken archive command > is superior because we can be sure of the behavior. If we just rely on > not having crossed a segment boundary, then anything that causes more > WAL to be generated than we are expecting could break the test. I > don't think it's particularly likely in a case like this that > autovacuum or any other thing would kick in and generate extra WAL, > but the broken archive command ensures that even if it does happen, > the test will still work as intended. That, to me, seems like a good > enough reason to do it that way. Yeah. It is what the most convincing reason. > > FWIW, regarding to the name of the test script, putting aside what it > > actually does, I proposed to place it as a part or > > 004_timeline_switch.pl because this issue is related to timeline > > switching. > > I think it is better to keep it separate. Long test scripts that test > multiple things with completely separate tests are hard to read. Agreed. I often annoyed by a long-lasting TAP script when I wanted to do one of the test items in it. However, I was not sure which is our policy here, consolidating all related tests into one script or having separate scripts containing tests up to a "certain" number or a set of tests that would take a certain time, or limiting by number the of lines. I thought that we are on the first way as I have told several times to put new tests into an existing script. > Kyotaro-san, I hope I have not given any offense. I am doing my best, > and certainly did not mean to be rude. No. Thanks for the words, Robert. I might be a bit too naive, but I had an anxious feeling that I might have been totally pointless or my words might have been too cryptic/broken (my fingers are quite fat), or I might have done something wrong or anything other. Anyway I thought I might have done something wrong here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
(Sorry for being a bit off-topic) At Fri, 28 May 2021 12:18:35 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > Hi Horiguchi-san, (Why me?) > In a project I helped with, I encountered an issue where > the archive command kept failing. I thought this issue was > related to the problem in this thread, so I'm sharing it here. > If I should create a new thread, please let me know. > > * Problem > - The archive_command is failed always. Although I think the configuration is a kind of broken, it can be seen as it is mimicing the case of shared-archive, where primary and standby share the same archive directory. Basically we need to use an archive command like the following for that case to avoid this kind of failure. The script returns "success" when the target file is found but identical with the source file. I don't find such a description in the documentation, and haven't bothered digging into the mailing-list archive. == #! /bin/bash if [ -f $2 ]; then cmp -s $1 $2 if [ $? != 0 ]; then exit 1 fi exit 0 fi cp $1 $2 == A maybe-non-optimal behavior is both 00000002.history.done and .ready files are found at once in archive_status directory but that doesn't practically matter. (Or I faintly remember that it is designed to work even in that case.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Horiguchi-san, > (Why me?) Because the story was also related to PG-REX, which you are also involved in developing. Perhaps off-list instead of -hackers would have been better, but I emailed -hackers because the same problem could be encountered by PostgreSQL users who do not use PG-REX. >> In a project I helped with, I encountered an issue where >> the archive command kept failing. I thought this issue was >> related to the problem in this thread, so I'm sharing it here. >> If I should create a new thread, please let me know. >> >> * Problem >> - The archive_command is failed always. > > Although I think the configuration is a kind of broken, it can be seen > as it is mimicing the case of shared-archive, where primary and > standby share the same archive directory. To be precise, the environment of this reproduction script is different from our actual environment. I tried to make it as simple as possible to reproduce the problem. (In order to make it look like the actual environment, you have to build a PG-REX environment.) A simple replication environment might be enough, so I'll try to recreate a script that is closer to the actual environment later. > Basically we need to use an archive command like the following for > that case to avoid this kind of failure. The script returns "success" > when the target file is found but identical with the source file. I > don't find such a description in the documentation, and haven't > bothered digging into the mailing-list archive. > > == > #! /bin/bash > > if [ -f $2 ]; then > cmp -s $1 $2 > if [ $? != 0 ]; then > exit 1 > fi > exit 0 > fi > > cp $1 $2 > == Thanks for your reply. Since the above behavior is different from the behavior of the test command in the following example in postgresql.conf, I think we should write a note about this example. # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' Let me describe the problem we faced. - When archive_mode=always, archive_command is (sometimes) executed in a situation where the history file already exists on the standby side. - In this case, if "test ! -f" is written in the archive_command of postgresql.conf on the standby side, the command will keep failing. Note that this problem does not occur when archive_mode=on. So, what should we do for the user? I think we should put some notes in postgresql.conf or in the documentation. For example, something like this: ==== Note: If you use archive_mode=always, the archive_command on the standby side should not be used "test ! -f". ==== Regards, Tatsuro Yamada
So, I started a thread for this topic diverged from the following thread. https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1 At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > Since the above behavior is different from the behavior of the > test command in the following example in postgresql.conf, I think > we should write a note about this example. > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p > # /mnt/server/archivedir/%f' > > Let me describe the problem we faced. > - When archive_mode=always, archive_command is (sometimes) executed > in a situation where the history file already exists on the standby > side. > > - In this case, if "test ! -f" is written in the archive_command of > postgresql.conf on the standby side, the command will keep failing. > > Note that this problem does not occur when archive_mode=on. > > So, what should we do for the user? I think we should put some notes > in postgresql.conf or in the documentation. For example, something > like this: I'm not sure about the exact configuration you have in mind, but that would happen on the cascaded standby in the case where the upstream promotes. In this case, the history file for the new timeline is archived twice. walreceiver triggers archiving of the new history file at the time of the promotion, then startup does the same when it restores the file from archive. Is it what you complained about? The same workaround using the alternative archive script works for the case. We could check pg_wal before fetching archive, however, archiving is not controlled so strictly that duplicate archiving never happens and I think we choose possible duplicate archiving than having holes in archive. (so we suggest the "test ! -f" script) > ==== > Note: If you use archive_mode=always, the archive_command on the > standby side should not be used "test ! -f". > ==== It could be one workaround. However, I would suggest not to overwrite existing files (with a file with different content) to protect archive from corruption. We might need to write that in the documentation... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Moved to another thread. https://www.postgresql.org/message-id/20210531.165825.921389284096975508.horikyota.ntt@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Horiguchi-san, On 2021/05/31 16:58, Kyotaro Horiguchi wrote: > So, I started a thread for this topic diverged from the following > thread. > > https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1 > >> So, what should we do for the user? I think we should put some notes >> in postgresql.conf or in the documentation. For example, something >> like this: > > I'm not sure about the exact configuration you have in mind, but that > would happen on the cascaded standby in the case where the upstream > promotes. In this case, the history file for the new timeline is > archived twice. walreceiver triggers archiving of the new history > file at the time of the promotion, then startup does the same when it > restores the file from archive. Is it what you complained about? Thank you for creating a new thread and explaining this. We are not using cascade replication in our environment, but I think the situation is similar. As an overview, when I do a promote, the archive_command fails due to the history file. I've created a reproduction script that includes building replication, and I'll share it with you. (I used Robert's test.sh as a reference for creating the reproduction script. Thanks) The scenario (sr_test_historyfile.sh) is as follows. #1 Start pgprimary as a main #2 Create standby #3 Start pgstandby as a standby #4 Execute archive command #5 Shutdown pgprimary #6 Start pgprimary as a standby #7 Promote pgprimary #8 Execute archive_command again, but failed since duplicate history file exists (see pgstandby.log) Note that this may not be appropriate if you consider it as a recovery procedure for replication configuration. However, I'm sharing it as it is because this seems to be the procedure used in the customer's environment (PG-REX). > The same workaround using the alternative archive script works for the > case. > > We could check pg_wal before fetching archive, however, archiving is > not controlled so strictly that duplicate archiving never happens and > I think we choose possible duplicate archiving than having holes in > archive. (so we suggest the "test ! -f" script) > >> ==== >> Note: If you use archive_mode=always, the archive_command on the >> standby side should not be used "test ! -f". >> ==== > > It could be one workaround. However, I would suggest not to overwrite > existing files (with a file with different content) to protect archive > from corruption. > > We might need to write that in the documentation... I think you're right, replacing it with an alternative archive script that includes the cmp command will resolve the error. The reason is that I checked with the diff command that the history files are identical. ===== $ diff -s pgprimary/arc/00000002.history pgstandby/arc/00000002.history Files pgprimary/arc/00000002.history and pgstandby/arc/00000002.history are identical ===== Regarding "test ! -f", I am wondering how many people are using the test command for archive_command. If I remember correctly, the guide provided by NTT OSS Center that we are using does not recommend using the test command. Regards, Tatsuro Yamada
Вложения
On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Mmmm. That looks like meaning that we don't intend to support the > Dilip's case, and means that we support the use of > archive-command-copies-only-other-than-wal-segments? Actually, I think Dilip's case ought to be supported, but I also think that somebody else might disagree, so it's better for me if the test doesn't need to rely on it. > Agreed. I often annoyed by a long-lasting TAP script when I wanted to > do one of the test items in it. However, I was not sure which is our > policy here, consolidating all related tests into one script or having > separate scripts containing tests up to a "certain" number or a set of > tests that would take a certain time, or limiting by number the of > lines. I thought that we are on the first way as I have told several > times to put new tests into an existing script. Different people might have different opinions about this, but my opinion is that when it's possible to combine the test cases in a way that feels natural, it's good to do. For example if I have two tests that require the same setup and teardown but do different things in the middle, and if those things seem related, then it's great to set up once, try both things, and tear down once. However I don't support combining test cases where it's just concatenating them one after another, because that sort of thing seems to have no benefit. Fewer files in the source tree is not a goal of itself. > No. Thanks for the words, Robert. I might be a bit too naive, but I > had an anxious feeling that I might have been totally pointless or my > words might have been too cryptic/broken (my fingers are quite fat), > or I might have done something wrong or anything other. Anyway I > thought I might have done something wrong here. No, I don't think so. I think the difficulty is more that the three of us who are mostly involved in this conversation all have different native languages, and we are trying to discuss an issue which is very subtle. Sometimes I am having difficulty understanding precisely what either you or Dilip are intending to say, and it would not surprise me to learn that there are difficulties in the other direction also. If we seem to be covering the same topics multiple times or if any important points seem to be getting ignored, that's probably the reason. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 01, 2021 at 04:45:52PM -0400, Robert Haas wrote: > On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Agreed. I often annoyed by a long-lasting TAP script when I wanted to > > do one of the test items in it. However, I was not sure which is our > > policy here, consolidating all related tests into one script or having > > separate scripts containing tests up to a "certain" number or a set of > > tests that would take a certain time, or limiting by number the of > > lines. I thought that we are on the first way as I have told several > > times to put new tests into an existing script. > > Different people might have different opinions about this, but my > opinion is that when it's possible to combine the test cases in a way > that feels natural, it's good to do. For example if I have two tests > that require the same setup and teardown but do different things in > the middle, and if those things seem related, then it's great to set > up once, try both things, and tear down once. However I don't support > combining test cases where it's just concatenating them one after > another, because that sort of thing seems to have no benefit. Fewer > files in the source tree is not a goal of itself. I agree, particularly for the recovery and subscription TAP suites. When one of those tests fails on the buildfarm, it's often not obvious to me which log messages are relevant to the failure. Smaller test files simplify the investigation somewhat.
At Tue, 1 Jun 2021 16:45:52 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > Mmmm. That looks like meaning that we don't intend to support the > > Dilip's case, and means that we support the use of > > archive-command-copies-only-other-than-wal-segments? > > Actually, I think Dilip's case ought to be supported, but I also think > that somebody else might disagree, so it's better for me if the test > doesn't need to rely on it. Understood. > > Agreed. I often annoyed by a long-lasting TAP script when I wanted to > > do one of the test items in it. However, I was not sure which is our > > policy here, consolidating all related tests into one script or having > > separate scripts containing tests up to a "certain" number or a set of > > tests that would take a certain time, or limiting by number the of > > lines. I thought that we are on the first way as I have told several > > times to put new tests into an existing script. > > Different people might have different opinions about this, but my > opinion is that when it's possible to combine the test cases in a way > that feels natural, it's good to do. For example if I have two tests > that require the same setup and teardown but do different things in > the middle, and if those things seem related, then it's great to set > up once, try both things, and tear down once. However I don't support > combining test cases where it's just concatenating them one after > another, because that sort of thing seems to have no benefit. Fewer > files in the source tree is not a goal of itself. Sounds like a reasonable criteria. > > No. Thanks for the words, Robert. I might be a bit too naive, but I > > had an anxious feeling that I might have been totally pointless or my > > words might have been too cryptic/broken (my fingers are quite fat), > > or I might have done something wrong or anything other. Anyway I > > thought I might have done something wrong here. > > No, I don't think so. I think the difficulty is more that the three of > us who are mostly involved in this conversation all have different > native languages, and we are trying to discuss an issue which is very > subtle. Sometimes I am having difficulty understanding precisely what > either you or Dilip are intending to say, and it would not surprise me > to learn that there are difficulties in the other direction also. If > we seem to be covering the same topics multiple times or if any > important points seem to be getting ignored, that's probably the > reason. That makes me convinced. Thanks for the thought and sorry for bothering with the complaint. Anyway, Now I agree to the all of the direction here. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 01 Jun 2021 13:03:22 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > Hi Horiguchi-san, > > On 2021/05/31 16:58, Kyotaro Horiguchi wrote: > > So, I started a thread for this topic diverged from the following > > thread. > > https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1 > > > >> So, what should we do for the user? I think we should put some notes > >> in postgresql.conf or in the documentation. For example, something > >> like this: > > I'm not sure about the exact configuration you have in mind, but that > > would happen on the cascaded standby in the case where the upstream > > promotes. In this case, the history file for the new timeline is > > archived twice. walreceiver triggers archiving of the new history > > file at the time of the promotion, then startup does the same when it > > restores the file from archive. Is it what you complained about? > > > Thank you for creating a new thread and explaining this. > We are not using cascade replication in our environment, but I think > the situation is similar. As an overview, when I do a promote, > the archive_command fails due to the history file. Ah, I remembered that PG-REX starts a primary as a standby then promotes it. > I've created a reproduction script that includes building replication, > and I'll share it with you. (I used Robert's test.sh as a reference > for creating the reproduction script. Thanks) > > The scenario (sr_test_historyfile.sh) is as follows. > > #1 Start pgprimary as a main > #2 Create standby > #3 Start pgstandby as a standby > #4 Execute archive command > #5 Shutdown pgprimary > #6 Start pgprimary as a standby > #7 Promote pgprimary > #8 Execute archive_command again, but failed since duplicate history > file exists (see pgstandby.log) Ok, I clearly understood what you meant. (However, it is not the legit state where a standby is running without the primary is running..) Anyway the "test ! -f" can be problematic in the case. > Note that this may not be appropriate if you consider it as a recovery > procedure for replication configuration. However, I'm sharing it as it > is > because this seems to be the procedure used in the customer's > environment (PG-REX). Understood. > Regarding "test ! -f", > I am wondering how many people are using the test command for > archive_command. If I remember correctly, the guide provided by > NTT OSS Center that we are using does not recommend using the test > command. I think, as the PG-REX documentation says, the simple cp works well as far as the assumption of PG-REX - no double failure happenes, and following the instruction - holds. On the other hand, I found that the behavior happens more generally. If a standby with archive_mode=always craches, it starts recovery from the last checkpoint. If the checkpoint were in a archived segment, the restarted standby will fetch the already-archived segment from archive then fails to archive it. (The attached). So, your fear stated upthread is applicable for wider situations. The following suggestion is rather harmful for the archive_mode=always setting. https://www.postgresql.org/docs/14/continuous-archiving.html > The archive command should generally be designed to refuse to > overwrite any pre-existing archive file. This is an important safety > feature to preserve the integrity of your archive in case of > administrator error (such as sending the output of two different > servers to the same archive directory). I'm not sure how we should treat this.. Since archive must store files actually applied to the server data, just being already archived cannot be the reason for omitting archiving. We need to make sure the new file is byte-identical to the already-archived version. We could compare just *restored* file to the same file in pg_wal but it might be too much of penalty for for the benefit. (Attached second file.) Otherwise the documentation would need someting like the following if we assume the current behavior. > The archive command should generally be designed to refuse to > overwrite any pre-existing archive file. This is an important safety > feature to preserve the integrity of your archive in case of > administrator error (such as sending the output of two different > servers to the same archive directory). + For standby with the setting archive_mode=always, there's a case where + the same file is archived more than once. For safety, it is + recommended that when the destination file exists, the archive_command + returns zero if it is byte-identical to the source file. regards. -- Kyotaro Horiguchi NTT Open Source Software Center # Copyright (c) 2021, PostgreSQL Global Development Group # # Tests related to WAL archiving and recovery. # use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 1; use Config; my $backup_name='mybackup'; my $primary = get_new_node('primary'); $primary->init( has_archiving => 1, allows_streaming => 1); $primary->append_conf('postgresql.conf', qq[ wal_keep_size=128MB archive_mode=always log_checkpoints=yes ]); my $primary_archive = $primary->archive_dir; $primary->start; $primary->backup($backup_name); my $standby = get_new_node('standby'); my $standby_archive = $standby->archive_dir; $standby->init_from_backup($primary, $backup_name, has_streaming=>1); $standby->append_conf('postgresql.conf', qq[ restore_command='cp $primary_archive/%f %p' archive_command='test ! -f $standby_archive/%f && cp %p $standby_archive/%f' ]); $standby->start; $primary->psql('postgres', 'CHECKPOINT;SELECT pg_switch_wal();CREATE TABLE t(); pg_switch_wal();'); $standby->psql('postgres', 'CHECKPOINT'); $standby->stop('immediate'); $standby->start; $primary->psql('postgres', 'CHECKPOINT;SELECT pg_switch_wal();CHECKPOINT'); $standby->psql('postgres', 'CHECKPOINT'); my $result; while (1) { $result = $standby->safe_psql('postgres', "SELECT last_archived_wal, last_failed_wal FROM pg_stat_archiver"); sleep(0.1); last if ($result ne "|"); } ok($result =~ /^[^|]+\|$/, 'archive check 1'); diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e754..037da0aa3d 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -382,6 +382,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) { char xlogfpath[MAXPGPATH]; bool reload = false; + bool skip_archive = false; struct stat statbuf; snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname); @@ -416,6 +417,56 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) /* same-size buffers, so this never truncates */ strlcpy(oldpath, xlogfpath, MAXPGPATH); #endif + /* + * On a standby with archive_mode=always, there's the case where the + * same file is archived more than once. If the archive_command rejects + * overwriting, WAL-archiving won't go further than the file forever. + * Avoid duplicate archiving attempts when the file is known to have + * been archived and the content doesn't change. + */ + if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && + XLogArchiveCheckDone(xlogfname)) + { + unsigned char srcbuf[XLOG_BLCKSZ]; + unsigned char dstbuf[XLOG_BLCKSZ]; + int fd1 = BasicOpenFile(path, O_RDONLY | PG_BINARY); + int fd2 = BasicOpenFile(oldpath, O_RDONLY | PG_BINARY); + uint32 i; + uint32 off = 0; + + /* + * Compare the two files' contents. We don't bother completing if + * something's wrong meanwhile. + */ + for (i = 0 ; i < wal_segment_size / XLOG_BLCKSZ ; i++) + { + if (pg_pread(fd1, srcbuf, XLOG_BLCKSZ, (off_t) off) + != XLOG_BLCKSZ) + break; + + if (pg_pread(fd2, dstbuf, XLOG_BLCKSZ, (off_t) off) + != XLOG_BLCKSZ) + break; + + if (memcmp(srcbuf, dstbuf, XLOG_BLCKSZ) != 0) + break; + + off += XLOG_BLCKSZ; + } + + close(fd1); + close(fd2); + + if (i == wal_segment_size / XLOG_BLCKSZ) + { + skip_archive = true; + + ereport(LOG, + (errmsg ("log file \"%s\" have been already archived, skip archiving", + xlogfname))); + } + } + if (unlink(oldpath) != 0) ereport(FATAL, (errcode_for_file_access(), @@ -430,7 +481,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) * Create .done file forcibly to prevent the restored segment from being * archived again later. */ - if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) + if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS || skip_archive) XLogArchiveForceDone(xlogfname); else XLogArchiveNotify(xlogfname);
On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Changed as suggested. I don't think the code as written here is going to work on Windows, because your code doesn't duplicate enable_restoring's call to perl2host or its backslash-escaping logic. It would really be better if we could use enable_restoring directly. Also, I discovered that the 'return' in cp_history_files should really say 'exit', because otherwise it generates a complaint every time it's run. It should also have 'use strict' and 'use warnings' at the top. Here's a version of your test case patch with the 1-line code fix added, the above issues addressed, and a bunch of cosmetic tweaks. Unfortunately, it doesn't pass for me consistently. I'm not sure if that's because I broke something with my changes, or because the test contains an underlying race condition which we need to address. Attached also are the log files from a failed run if you want to look at them. The key lines seem to be: 2021-06-03 16:16:53.984 EDT [47796] LOG: restarted WAL streaming at 0/3000000 on timeline 2 2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl LOG: statement: SELECT count(*) FROM tab_int 2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl ERROR: relation "tab_int" does not exist at character 22 Or from the main log: Waiting for replication conn cascade's replay_lsn to pass '0/3000000' on standby done error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist LINE 1: SELECT count(*) FROM tab_int ^' I wonder whether that problem points to an issue with this incantation: $node_standby->wait_for_catchup($node_cascade, 'replay', $node_standby->lsn('replay')); But I'm not sure, and I'm out of time to investigate for today. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
At Thu, 03 Jun 2021 21:52:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > https://www.postgresql.org/docs/14/continuous-archiving.html > > The archive command should generally be designed to refuse to > > overwrite any pre-existing archive file. This is an important safety > > feature to preserve the integrity of your archive in case of > > administrator error (such as sending the output of two different > > servers to the same archive directory). > > I'm not sure how we should treat this.. Since archive must store > files actually applied to the server data, just being already archived > cannot be the reason for omitting archiving. We need to make sure the > new file is byte-identical to the already-archived version. We could > compare just *restored* file to the same file in pg_wal but it might > be too much of penalty for for the benefit. (Attached second file.) (To recap: In a replication set using archive, startup tries to restore WAL files from archive before checking pg_wal directory for the desired file. The behavior itself is intentionally designed and reasonable. However, the restore code notifies of a restored file regardless of whether it has been already archived or not. If archive_command is written so as to return error for overwriting as we suggest in the documentation, that behavior causes archive failure.) After playing with this, I see the problem just by restarting a standby even in a simple archive-replication set after making not-special prerequisites. So I think this is worth fixing. With this patch, KeepFileRestoredFromArchive compares the contents of just-restored file and the existing file for the same segment only when: - archive_mode = always and - the file to restore already exists in pgwal and - it has a .done and/or .ready status file. which doesn't happen usually. Then the function skips archive notification if the contents are identical. The included TAP test is working both on Linux and Windows. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl index 777bf05274..ed5e80327b 100644 --- a/src/test/recovery/t/020_archive_status.pl +++ b/src/test/recovery/t/020_archive_status.pl @@ -8,7 +8,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 16; +use Test::More tests => 17; use Config; my $primary = get_new_node('primary'); @@ -176,6 +176,7 @@ ok( -f "$standby1_data/$segment_path_2_done", # will cause archiving failures. my $standby2 = get_new_node('standby2'); $standby2->init_from_backup($primary, 'backup', has_restoring => 1); +$standby2->enable_archiving; $standby2->append_conf('postgresql.conf', 'archive_mode = always'); my $standby2_data = $standby2->data_dir; $standby2->start; @@ -234,3 +235,49 @@ ok( -f "$standby2_data/$segment_path_1_done" && -f "$standby2_data/$segment_path_2_done", ".done files created after archive success with archive_mode=always on standby" ); + + +# Check if restart of a archive-replicated standby doesn't archive the +# same file twice. + +# Use duplicate-resistent archive_command +my $archpath = TestLib::perl2host($standby2->archive_dir); +$archpath =~ s{\\}{\\\\}g if ($TestLib::windows_os); +my $nodup_command = + $TestLib::windows_os + ? qq{if not exist "$archpath\\\\%f" (copy "%p" "$archpath\\\\%f") else (exit 1)} + : qq{test ! -f "$archpath/%f" && cp "%p" "$archpath/%f"}; +$standby2->safe_psql( + 'postgres', qq{ + ALTER SYSTEM SET archive_command TO '$nodup_command'; + SELECT pg_reload_conf(); +}); + +# Remember the current semgent file, The +1 below is an adjustment to avoid +# segment borders. +my $cursegfile = $primary->safe_psql( + 'postgres', + 'SELECT pg_walfile_name(pg_current_wal_lsn() + 1)'); +$primary->psql('postgres', 'CHECKPOINT; SELECT pg_switch_wal();'); +$standby2->psql('postgres', 'CHECKPOINT'); +$standby2->poll_query_until( + 'postgres', + "SELECT last_archived_wal = '$cursegfile' FROM pg_stat_archiver") + or die "Timed out waiting for the segment to be archived"; + +$standby2->restart; + +$primary->psql('postgres', 'CHECKPOINT; SELECT pg_switch_wal();'); + +$standby2->poll_query_until( + 'postgres', qq[ + SELECT last_archived_wal IS NOT NULL OR + last_failed_wal IS NOT NULL + FROM pg_stat_archiver]); + +my $result = + $standby2->safe_psql( + 'postgres', + "SELECT last_archived_wal, last_failed_wal FROM pg_stat_archiver"); + +ok($result =~ /^[^|]+\|$/, 'check if archiving proceeds'); diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e754..7d94e7f963 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -382,6 +382,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) { char xlogfpath[MAXPGPATH]; bool reload = false; + bool already_archived = false; struct stat statbuf; snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname); @@ -416,6 +417,68 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) /* same-size buffers, so this never truncates */ strlcpy(oldpath, xlogfpath, MAXPGPATH); #endif + /* + * On a standby with archive_mode=always, there's the case where the + * same file is archived more than once. If the archive_command rejects + * overwriting, WAL-archiving won't go further than the file forever. + * Avoid duplicate archiving attempts when the file with the same + * content is known to have been already archived or notified. + */ + if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && + XLogArchiveIsReadyOrDone(xlogfname)) + { + int fd1; + int fd2 = -1; + + fd1 = BasicOpenFile(path, O_RDONLY | PG_BINARY); + if (fd1 >= 0) + fd2 = BasicOpenFile(oldpath, O_RDONLY | PG_BINARY); + + if (fd1 < 0 || fd2 < 0) + { + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open file \"%s\", skip duplicate check: %m", + fd1 < 0 ? path : oldpath))); + if (fd1 >= 0) + close(fd1); + } + else + { + unsigned char srcbuf[XLOG_BLCKSZ]; + unsigned char dstbuf[XLOG_BLCKSZ]; + uint32 i; + + /* + * Compare the two files' contents. We don't bother + * completing if something's wrong meanwhile. + */ + for (i = 0 ; i < wal_segment_size / XLOG_BLCKSZ ; i++) + { + if (read(fd1, srcbuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) + break; + + if (read(fd2, dstbuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) + break; + + if (memcmp(srcbuf, dstbuf, XLOG_BLCKSZ) != 0) + break; + } + + close(fd1); + close(fd2); + + if (i == wal_segment_size / XLOG_BLCKSZ) + { + already_archived = true; + + ereport(LOG, + (errmsg("log file \"%s\" have been already archived, skip archiving", + xlogfname))); + } + } + } + if (unlink(oldpath) != 0) ereport(FATAL, (errcode_for_file_access(), @@ -432,7 +495,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) */ if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) XLogArchiveForceDone(xlogfname); - else + else if (!already_archived) XLogArchiveNotify(xlogfname); /*
On Fri, Jun 4, 2021 at 2:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Changed as suggested. > > I don't think the code as written here is going to work on Windows, > because your code doesn't duplicate enable_restoring's call to > perl2host or its backslash-escaping logic. It would really be better > if we could use enable_restoring directly. Also, I discovered that the > 'return' in cp_history_files should really say 'exit', because > otherwise it generates a complaint every time it's run. It should also > have 'use strict' and 'use warnings' at the top. Ok > Here's a version of your test case patch with the 1-line code fix > added, the above issues addressed, and a bunch of cosmetic tweaks. > Unfortunately, it doesn't pass for me consistently. I'm not sure if > that's because I broke something with my changes, or because the test > contains an underlying race condition which we need to address. > Attached also are the log files from a failed run if you want to look > at them. The key lines seem to be: I could not reproduce this but I think I got the issue, I think I used the wrong target LSN in wait_for_catchup, instead of checking the last "insert LSN" of the standby I was waiting for last "replay LSN" of standby which was wrong. Changed as below in the attached patch. diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl index 09eb3eb..ee7d78d 100644 --- a/src/test/recovery/t/025_stuck_on_old_timeline.pl +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -78,7 +78,7 @@ $node_standby->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT 1 AS a"); # Wait for the replication to catch up $node_standby->wait_for_catchup($node_cascade, 'replay', - $node_standby->lsn('replay')); + $node_standby->lsn('insert')); # Check that cascading standby has the new content my $result = -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
At Fri, 4 Jun 2021 13:21:08 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Fri, Jun 4, 2021 at 2:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Changed as suggested. > > > > I don't think the code as written here is going to work on Windows, > > because your code doesn't duplicate enable_restoring's call to > > perl2host or its backslash-escaping logic. It would really be better > > if we could use enable_restoring directly. Also, I discovered that the > > 'return' in cp_history_files should really say 'exit', because > > otherwise it generates a complaint every time it's run. It should also > > have 'use strict' and 'use warnings' at the top. > > Ok > > > Here's a version of your test case patch with the 1-line code fix > > added, the above issues addressed, and a bunch of cosmetic tweaks. > > Unfortunately, it doesn't pass for me consistently. I'm not sure if > > that's because I broke something with my changes, or because the test > > contains an underlying race condition which we need to address. > > Attached also are the log files from a failed run if you want to look > > at them. The key lines seem to be: > > I could not reproduce this but I think I got the issue, I think I used > the wrong target LSN in wait_for_catchup, instead of checking the last > "insert LSN" of the standby I was waiting for last "replay LSN" of > standby which was wrong. Changed as below in the attached patch. I think that's right. And the test script detects the issue for me both on Linux but doesn't work for Windows. '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command or externalcommand .. Because Windows' cmd.exe doesn't have the shbang feature. On Windows, maybe archive_command should be like '".../perl" "$FindBin../cp_history_files" "%p"... If I did this I got another error. "couldn't copy pg_wal\00000002.history to C:/../Documents/work/postgresql/src est^Mecovery/tmp_check/t_000_a_primary_data/archives/00000002.history: at C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_filesline 10.^M" ("^M" are the replacement for carrage return) So.. I'm not sure what is happening but the error messages, or.. Anyway I don't have a time to investigate it. + # clean up + $node_primary->teardown_node; + $node_standby->teardown_node; + $node_cascade->teardown_node; I don't think explicit teardown is useless as the final cleanup. By the way the attached patch is named as "Fix-corner-case..." but doesn't contain the fix. Is it intentional? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jun 4, 2021 at 3:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I could not reproduce this but I think I got the issue, I think I used > the wrong target LSN in wait_for_catchup, instead of checking the last > "insert LSN" of the standby I was waiting for last "replay LSN" of > standby which was wrong. Changed as below in the attached patch. Yeah, that fixes it for me. Thanks. With that change, this test reliably passes for me with the fix, and reliably fails for me without the fix. Woohoo! -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > I think that's right. And the test script detects the issue for me > both on Linux but doesn't work for Windows. > > '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command or externalcommand .. Hmm, that's a problem. Can you try the attached version? > + # clean up > + $node_primary->teardown_node; > + $node_standby->teardown_node; > + $node_cascade->teardown_node; > > I don't think explicit teardown is useless as the final cleanup. I don't know what you mean by this. If it's not useless, good, because we're doing it. Or do you mean that you think it is useless, and we should remove it? > By the way the attached patch is named as "Fix-corner-case..." but > doesn't contain the fix. Is it intentional? No, that was a goof. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
At Fri, 4 Jun 2021 10:56:12 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > I think that's right. And the test script detects the issue for me > > both on Linux but doesn't work for Windows. > > > > '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command orexternal command .. > > Hmm, that's a problem. Can you try the attached version? Unfortunately no. The backslashes in the binary path need to be escaped. (taken from PostgresNode.pm:1008) > (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os); > $node_primary->append_conf( > 'postgresql.conf', qq( > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' > )); This works for me. > > + # clean up > > + $node_primary->teardown_node; > > + $node_standby->teardown_node; > > + $node_cascade->teardown_node; > > > > I don't think explicit teardown is useless as the final cleanup. > > I don't know what you mean by this. If it's not useless, good, because > we're doing it. Or do you mean that you think it is useless, and we > should remove it? Ugh! Sorry. I meant "The explicit teardowns are useless". That's not harmful but it is done by PostgresNode.pm automatically(implicitly) and we don't do that in the existing scripts. > > By the way the attached patch is named as "Fix-corner-case..." but > > doesn't contain the fix. Is it intentional? > > No, that was a goof. As I said upthread the relationship between receiveTLI and recoveryTargetTLI is not confirmed yet at the point. findNewestTimeLine() simply searches for the history file with the largest timeline id so the returned there's a case where the timeline id that the function returns is not a future of the latest checkpoint TLI. I think that the fact that rescanLatestTimeLine() checks the relationship is telling us that we need to do the same in the path as well. In my previous proposal, it is done just after the line the patch touches but it can be in the if (fetching_ckpt) branch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, some extra words are left alone. At Mon, 07 Jun 2021 13:57:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > As I said upthread the relationship between receiveTLI and > recoveryTargetTLI is not confirmed yet at the point. - findNewestTimeLine() simply searches for the history file with the - largest timeline id so the returned there's a case where the timeline + findNewestTimeLine() simply searches for the history file with the + largest timeline id so there's a case where the timeline > id that the function returns is not a future of the latest checkpoint > TLI. I think that the fact that rescanLatestTimeLine() checks the > relationship is telling us that we need to do the same in the path as > well. > > In my previous proposal, it is done just after the line the patch > touches but it can be in the if (fetching_ckpt) branch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Horiguchi-san, >> Regarding "test ! -f", >> I am wondering how many people are using the test command for >> archive_command. If I remember correctly, the guide provided by >> NTT OSS Center that we are using does not recommend using the test >> command. > > I think, as the PG-REX documentation says, the simple cp works well as > far as the assumption of PG-REX - no double failure happenes, and > following the instruction - holds. I believe that this assumption started to be wrong after archive_mode=always was introduced. As far as I can tell, it doesn't happen when it's archive_mode=on. > On the other hand, I found that the behavior happens more generally. > > If a standby with archive_mode=always craches, it starts recovery from > the last checkpoint. If the checkpoint were in a archived segment, the > restarted standby will fetch the already-archived segment from archive > then fails to archive it. (The attached). > > So, your fear stated upthread is applicable for wider situations. The > following suggestion is rather harmful for the archive_mode=always > setting. > > https://www.postgresql.org/docs/14/continuous-archiving.html >> The archive command should generally be designed to refuse to >> overwrite any pre-existing archive file. This is an important safety >> feature to preserve the integrity of your archive in case of >> administrator error (such as sending the output of two different >> servers to the same archive directory). > > I'm not sure how we should treat this.. Since archive must store > files actually applied to the server data, just being already archived > cannot be the reason for omitting archiving. We need to make sure the > new file is byte-identical to the already-archived version. We could > compare just *restored* file to the same file in pg_wal but it might > be too much of penalty for for the benefit. (Attached second file.) Thanks for creating the patch! > Otherwise the documentation would need someting like the following if > we assume the current behavior. > >> The archive command should generally be designed to refuse to >> overwrite any pre-existing archive file. This is an important safety >> feature to preserve the integrity of your archive in case of >> administrator error (such as sending the output of two different >> servers to the same archive directory). > + For standby with the setting archive_mode=always, there's a case where > + the same file is archived more than once. For safety, it is > + recommended that when the destination file exists, the archive_command > + returns zero if it is byte-identical to the source file. Agreed. That is same solution as I mentioned earlier. If possible, it also would better to write it postgresql.conf (that might be overkill?!). Regards, Tatsuro Yamada
Hi Horiguchi-san, > (To recap: In a replication set using archive, startup tries to > restore WAL files from archive before checking pg_wal directory for > the desired file. The behavior itself is intentionally designed and > reasonable. However, the restore code notifies of a restored file > regardless of whether it has been already archived or not. If > archive_command is written so as to return error for overwriting as we > suggest in the documentation, that behavior causes archive failure.) > > After playing with this, I see the problem just by restarting a > standby even in a simple archive-replication set after making > not-special prerequisites. So I think this is worth fixing. > > With this patch, KeepFileRestoredFromArchive compares the contents of > just-restored file and the existing file for the same segment only > when: > > - archive_mode = always > and - the file to restore already exists in pgwal > and - it has a .done and/or .ready status file. > > which doesn't happen usually. Then the function skips archive > notification if the contents are identical. The included TAP test is > working both on Linux and Windows. Thank you for the analysis and the patch. I'll try the patch tomorrow. I just noticed that this thread is still tied to another thread (it's not an independent thread). To fix that, it may be better to create a new thread again. Regards, Tatsuro Yamada
At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > I just noticed that this thread is still tied to another thread > (it's not an independent thread). To fix that, it may be better to > create a new thread again. Mmm. Maybe my mailer automatically inserted In-Reply-To field for the cited messsage. Do we (the two of us) bother re-launching a new thread? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 07 Jun 2021 15:57:00 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > Hi Horiguchi-san, > > >> Regarding "test ! -f", > >> I am wondering how many people are using the test command for > >> archive_command. If I remember correctly, the guide provided by > >> NTT OSS Center that we are using does not recommend using the test > >> command. > > I think, as the PG-REX documentation says, the simple cp works well as > > far as the assumption of PG-REX - no double failure happenes, and > > following the instruction - holds. > > > I believe that this assumption started to be wrong after > archive_mode=always was introduced. As far as I can tell, it doesn't > happen when it's archive_mode=on. ?? Doesn't *simple* cp (without "test") work for you? I meant that the operating assumption of PG-REX ensures that overwriting doesn't cause a problem. > > Otherwise the documentation would need someting like the following if > > we assume the current behavior. > > > >> The archive command should generally be designed to refuse to > >> overwrite any pre-existing archive file. This is an important safety > >> feature to preserve the integrity of your archive in case of > >> administrator error (such as sending the output of two different > >> servers to the same archive directory). > > + For standby with the setting archive_mode=always, there's a case > > where > > + the same file is archived more than once. For safety, it is > > + recommended that when the destination file exists, the > > archive_command > > + returns zero if it is byte-identical to the source file. > > > Agreed. > That is same solution as I mentioned earlier. > If possible, it also would better to write it postgresql.conf (that > might > be overkill?!). Mmmm, I didn't noticed that. I don't think such a complex caveat fits the configuration file. And if we need such a caveart there, it might be the sign that we need to fix the causal behavior... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/06/07 16:31, Kyotaro Horiguchi wrote: > At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in >> I just noticed that this thread is still tied to another thread >> (it's not an independent thread). To fix that, it may be better to >> create a new thread again. > > Mmm. Maybe my mailer automatically inserted In-Reply-To field for the > cited messsage. Do we (the two of us) bother re-launching a new > thread? The reason I suggested it was because I thought it might be confusing if the threads were not independent when registered in a commitfest. If that is not a problem, then I'm fine with it as is. :-D Regards, Tatsuro Yamada
(Sorry for the noise on the old thread..) At Mon, 07 Jun 2021 16:54:49 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > On 2021/06/07 16:31, Kyotaro Horiguchi wrote: > > At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada > > <tatsuro.yamada.tf@nttcom.co.jp> wrote in > >> I just noticed that this thread is still tied to another thread > >> (it's not an independent thread). To fix that, it may be better to > >> create a new thread again. > > Mmm. Maybe my mailer automatically inserted In-Reply-To field for the > > cited messsage. Do we (the two of us) bother re-launching a new > > thread? > > > The reason I suggested it was because I thought it might be > confusing if the threads were not independent when registered in > a commitfest. If that is not a problem, then I'm fine with it as > is. :-D (You can freely do that, too:p) Hmm. I found that the pgsql-hackers archive treats the new thread as a part of the old thread, so CF-app would do the same. Anyway I re-launched a new standalone thread. https://www.postgresql.org/message-id/20210607.173108.348241508233844279.horikyota.ntt%40gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Unfortunately no. The backslashes in the binary path need to be > escaped. (taken from PostgresNode.pm:1008) > > > (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os); > > $node_primary->append_conf( > > 'postgresql.conf', qq( > > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' > > )); > > This works for me. Hmm, OK. Do you think we also need to use perl2host in this case? > Ugh! Sorry. I meant "The explicit teardowns are useless". That's not > harmful but it is done by PostgresNode.pm automatically(implicitly) > and we don't do that in the existing scripts. OK. I don't think it's a big deal, but we can remove them. > As I said upthread the relationship between receiveTLI and > recoveryTargetTLI is not confirmed yet at the point. > findNewestTimeLine() simply searches for the history file with the > largest timeline id so the returned there's a case where the timeline > id that the function returns is not a future of the latest checkpoint > TLI. I think that the fact that rescanLatestTimeLine() checks the > relationship is telling us that we need to do the same in the path as > well. > > In my previous proposal, it is done just after the line the patch > touches but it can be in the if (fetching_ckpt) branch. I went back and looked at your patch again, now that I understand the issue better. I believe it's not necessary to do this here, because StartupXLOG() already contains a check for the same thing: /* * If the location of the checkpoint record is not on the expected * timeline in the history of the requested timeline, we cannot proceed: * the backup is not part of the history of the requested timeline. */ Assert(expectedTLEs); /* was initialized by reading checkpoint * record */ if (tliOfPointInHistory(checkPointLoc, expectedTLEs) != checkPoint.ThisTimeLineID) ... This code is always run after ReadCheckpointRecord() returns. And I think that your only concern here is about the case where the checkpoint record is being fetched, because otherwise expectedTLEs must already be set. By the way, I also noticed that your version of the patch contains a few words which are spelled incorrectly: hearafter, and incosistent. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > So, this is the new new thread. This is definitely not the way I would recommend starting up a new thread as you didn't include the actual text of the prior discussion for people to be able to read and respond to, instead making them go hunt for the prior discussion on the old thread and negating the point of starting a new thread.. Still, I went and found the other email- * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > > Since the above behavior is different from the behavior of the > > test command in the following example in postgresql.conf, I think > > we should write a note about this example. > > > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p > > # /mnt/server/archivedir/%f' > > > > Let me describe the problem we faced. > > - When archive_mode=always, archive_command is (sometimes) executed > > in a situation where the history file already exists on the standby > > side. > > > > - In this case, if "test ! -f" is written in the archive_command of > > postgresql.conf on the standby side, the command will keep failing. > > > > Note that this problem does not occur when archive_mode=on. > > > > So, what should we do for the user? I think we should put some notes > > in postgresql.conf or in the documentation. For example, something > > like this: First off, we should tell them to not use test or cp in their actual archive command because they don't do things like make sure that the WAL that's been archived has actually been fsync'd. Multiple people have tried to make improvements in this area but the long and short of it is that trying to provide a simple archive command in the documentation that actually *works* isn't enough- you need a real tool. Maybe someone will write one some day that's part of core, but it's not happened yet and instead there's external solutions which actually do the correct things. The existing documentation should be taken as purely "this is how the variables which are passed in get expanded" not as "this is what you should do", because it's very much not the latter in any form. Thanks, Stephen
Вложения
Hi, I tried back-porting my version of this patch to 9.6 to see what would happen there. One problem is that some of the functions have different names before v10. So 9.6 needs this: - "SELECT pg_walfile_name(pg_current_wal_lsn());"); + "SELECT pg_xlogfile_name(pg_current_xlog_location());"); But there's also another problem, which is that this doesn't work before v12: $node_standby->psql('postgres', 'SELECT pg_promote()'); So I tried changing it to this: $node_standby->promote; But then the test fails, because pg_promote() has logic built into it to wait until the promotion actually happens, but ->promote doesn't, so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the system is still in recovery. I'm not sure what to do about that. I quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that didn't fix it, so I guess we'll have to find some other way to wait until the promotion is complete. -- Robert Haas EDB: http://www.enterprisedb.com
At Mon, 7 Jun 2021 10:40:27 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > Unfortunately no. The backslashes in the binary path need to be > > escaped. (taken from PostgresNode.pm:1008) > > > > > (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os); > > > $node_primary->append_conf( > > > 'postgresql.conf', qq( > > > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' > > > )); > > > > This works for me. > > Hmm, OK. Do you think we also need to use perl2host in this case? I understand that perl2host converts '/some/where' style path to the native windows path 'X:/any/where' if needed. Since perl's $^X is already in native style so I think we don't need to use it. > > Ugh! Sorry. I meant "The explicit teardowns are useless". That's not > > harmful but it is done by PostgresNode.pm automatically(implicitly) > > and we don't do that in the existing scripts. > > OK. I don't think it's a big deal, but we can remove them. Thanks. > I went back and looked at your patch again, now that I understand the > issue better. I believe it's not necessary to do this here, because > StartupXLOG() already contains a check for the same thing: > > /* > * If the location of the checkpoint record is not on the expected > * timeline in the history of the requested timeline, we cannot proceed: > * the backup is not part of the history of the requested timeline. > */ > Assert(expectedTLEs); /* was initialized by reading checkpoint > * record */ > if (tliOfPointInHistory(checkPointLoc, expectedTLEs) != > checkPoint.ThisTimeLineID) > ... > > This code is always run after ReadCheckpointRecord() returns. And I > think that your only concern here is about the case where the > checkpoint record is being fetched, because otherwise expectedTLEs > must already be set. Sure. Thanks for confirming that, and agreed. > By the way, I also noticed that your version of the patch contains a > few words which are spelled incorrectly: hearafter, and incosistent. Mmm. Sorry for them.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/06/07 17:32, Kyotaro Horiguchi wrote: >>>> I just noticed that this thread is still tied to another thread >>>> (it's not an independent thread). To fix that, it may be better to >>>> create a new thread again. >>> Mmm. Maybe my mailer automatically inserted In-Reply-To field for the >>> cited messsage. Do we (the two of us) bother re-launching a new >>> thread? >> >> >> The reason I suggested it was because I thought it might be >> confusing if the threads were not independent when registered in >> a commitfest. If that is not a problem, then I'm fine with it as >> is. :-D > > (You can freely do that, too:p) I should have told you that I would be happy to create a new thread. Why I didn't create new thread is that because I didn't want people to think I had hijacked the thread. :) > Hmm. I found that the pgsql-hackers archive treats the new thread as a > part of the old thread, so CF-app would do the same. > > Anyway I re-launched a new standalone thread. > > https://www.postgresql.org/message-id/20210607.173108.348241508233844279.horikyota.ntt%40gmail.com Thank you! Regards, Tatsuro Yamada
(Mmm. thunderbird or gmail connects this thread to the previous one..) At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <sfrost@snowman.net> wrote in > Greetings, > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > So, this is the new new thread. > > This is definitely not the way I would recommend starting up a new > thread as you didn't include the actual text of the prior discussion for > people to be able to read and respond to, instead making them go hunt > for the prior discussion on the old thread and negating the point of > starting a new thread.. Sorry for that. I'll do that next time. > Still, I went and found the other email- Thanks! > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > > > Since the above behavior is different from the behavior of the > > > test command in the following example in postgresql.conf, I think > > > we should write a note about this example. > > > > > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p > > > # /mnt/server/archivedir/%f' > > > > > > Let me describe the problem we faced. > > > - When archive_mode=always, archive_command is (sometimes) executed > > > in a situation where the history file already exists on the standby > > > side. > > > > > > - In this case, if "test ! -f" is written in the archive_command of > > > postgresql.conf on the standby side, the command will keep failing. > > > > > > Note that this problem does not occur when archive_mode=on. > > > > > > So, what should we do for the user? I think we should put some notes > > > in postgresql.conf or in the documentation. For example, something > > > like this: > > First off, we should tell them to not use test or cp in their actual > archive command because they don't do things like make sure that the WAL > that's been archived has actually been fsync'd. Multiple people have > tried to make improvements in this area but the long and short of it is > that trying to provide a simple archive command in the documentation > that actually *works* isn't enough- you need a real tool. Maybe someone > will write one some day that's part of core, but it's not happened yet > and instead there's external solutions which actually do the correct > things. Ideally I agree that it is definitely right. But the documentation doesn't say a bit of "don't use the simple copy command in any case (or at least the cases where more than a certain level of durability and integrity guarantee is required).". Actually many people are satisfied with just "cp/copy" and I think they know that the command doesn't guarantee on the integrity of archived files on , say, some disastrous situation like a sudden power cut. However, the use of "test ! -f..." is in a bit different kind of suggestion. https://www.postgresql.org/docs/13/continuous-archiving.html | The archive command should generally be designed to refuse to | overwrite any pre-existing archive file. This is an important safety | feature to preserve the integrity of your archive in case of | administrator error (such as sending the output of two different | servers to the same archive directory) This implies that no WAL segment are archived more than once at least under any valid operation. Some people are following this suggestion to prevent archive from breaking by some *wrong* operations. > The existing documentation should be taken as purely "this is how the > variables which are passed in get expanded" not as "this is what you > should do", because it's very much not the latter in any form. It describes "how archive_command should be like" and showing examples among the description implies that the example conforms the should-be's. Nevertheless, the issue here is that there's a case where archiving stalls when following the suggestion above under a certain condition. Even if it is written premising "set .. archive_mode to on", I don't believe that people can surmise that the same archive_command might fail when setting archive_mode to always, because the description implies So I think we need to revise the documentation, or need to *fix* the revealed problem that is breaking the assumption of the documentation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Yeah, it's hot these days... At Tue, 08 Jun 2021 12:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > (Mmm. thunderbird or gmail connects this thread to the previous one..) > > At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <sfrost@snowman.net> wrote in > > Greetings, > > > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > > So, this is the new new thread. > > > > This is definitely not the way I would recommend starting up a new > > thread as you didn't include the actual text of the prior discussion for > > people to be able to read and respond to, instead making them go hunt > > for the prior discussion on the old thread and negating the point of > > starting a new thread.. > > Sorry for that. I'll do that next time. > > > Still, I went and found the other email- > > Thanks! > > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in > > > > Since the above behavior is different from the behavior of the > > > > test command in the following example in postgresql.conf, I think > > > > we should write a note about this example. > > > > > > > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p > > > > # /mnt/server/archivedir/%f' > > > > > > > > Let me describe the problem we faced. > > > > - When archive_mode=always, archive_command is (sometimes) executed > > > > in a situation where the history file already exists on the standby > > > > side. > > > > > > > > - In this case, if "test ! -f" is written in the archive_command of > > > > postgresql.conf on the standby side, the command will keep failing. > > > > > > > > Note that this problem does not occur when archive_mode=on. > > > > > > > > So, what should we do for the user? I think we should put some notes > > > > in postgresql.conf or in the documentation. For example, something > > > > like this: > > > > First off, we should tell them to not use test or cp in their actual > > archive command because they don't do things like make sure that the WAL > > that's been archived has actually been fsync'd. Multiple people have > > tried to make improvements in this area but the long and short of it is > > that trying to provide a simple archive command in the documentation > > that actually *works* isn't enough- you need a real tool. Maybe someone > > will write one some day that's part of core, but it's not happened yet > > and instead there's external solutions which actually do the correct > > things. > > Ideally I agree that it is definitely right. But the documentation > doesn't say a bit of "don't use the simple copy command in any case > (or at least the cases where more than a certain level of durability > and integrity guarantee is required).". > > Actually many people are satisfied with just "cp/copy" and I think > they know that the command doesn't guarantee on the integrity of > archived files on , say, some disastrous situation like a sudden power > cut. > > However, the use of "test ! -f..." is in a bit different kind of > suggestion. > > https://www.postgresql.org/docs/13/continuous-archiving.html > | The archive command should generally be designed to refuse to > | overwrite any pre-existing archive file. This is an important safety > | feature to preserve the integrity of your archive in case of > | administrator error (such as sending the output of two different > | servers to the same archive directory) > > This implies that no WAL segment are archived more than once at least > under any valid operation. Some people are following this suggestion > to prevent archive from breaking by some *wrong* operations. > > > The existing documentation should be taken as purely "this is how the > > variables which are passed in get expanded" not as "this is what you > > should do", because it's very much not the latter in any form. > - It describes "how archive_command should be like" and showing examples + It describes "what archive_command should be like" and showing examples > among the description implies that the example conforms the > should-be's. > > Nevertheless, the issue here is that there's a case where archiving > stalls when following the suggestion above under a certain condition. > Even if it is written premising "set .. archive_mode to on", I don't > believe that people can surmise that the same archive_command might - fail when setting archive_mode to always, because the description - implies + fail when setting archive_mode to always. > > So I think we need to revise the documentation, or need to *fix* the > revealed problem that is breaking the assumption of the documentation. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jun 8, 2021 at 12:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I tried back-porting my version of this patch to 9.6 to see what would > happen there. One problem is that some of the functions have different > names before v10. So 9.6 needs this: > > - "SELECT pg_walfile_name(pg_current_wal_lsn());"); > + "SELECT pg_xlogfile_name(pg_current_xlog_location());"); > > But there's also another problem, which is that this doesn't work before v12: > > $node_standby->psql('postgres', 'SELECT pg_promote()'); > > So I tried changing it to this: > > $node_standby->promote; > > But then the test fails, because pg_promote() has logic built into it > to wait until the promotion actually happens, but ->promote doesn't, > so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the > system is still in recovery. I'm not sure what to do about that. I > quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that > didn't fix it, so I guess we'll have to find some other way to wait > until the promotion is complete. > Maybe we can use it ? # Wait until the node exits recovery. $standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';") or die "Timed out while waiting for promotion"; I will try to generate a version for 9.6 based on this idea and see how it goes -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 8, 2021 at 11:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > # Wait until the node exits recovery. > $standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';") > or die "Timed out while waiting for promotion"; > > I will try to generate a version for 9.6 based on this idea and see how it goes I have changed for as per 9.6 but I am seeing some crash (both with/without fix), I could not figure out the reason, it did not generate any core dump, although I changed pg_ctl in PostgresNode.pm to use "-c" so that it can generate core but it did not generate any core file. This is log from cascading node (025_stuck_on_old_timeline_cascade.log) ------------- cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_52dW/archives/000000010000000000000003’: No such file or directory WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: could not receive database system identifier and timeline ID from the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. -------------- The attached logs are when I ran without a fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
On Tue, Jun 8, 2021 at 4:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have changed for as per 9.6 but I am seeing some crash (both > with/without fix), I could not figure out the reason, it did not > generate any core dump, although I changed pg_ctl in PostgresNode.pm > to use "-c" so that it can generate core but it did not generate any > core file. I think the problem is here: Can't locate object method "lsn" via package "PostgresNode" at t/025_stuck_on_old_timeline.pl line 84. When that happens, it bails out, and cleans everything up, doing an immediate shutdown of all the nodes. The 'lsn' method was added by commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in v10 and later. I think maybe we can think of back-porting that to 9.6. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 8, 2021 at 12:26 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think the problem is here: > > Can't locate object method "lsn" via package "PostgresNode" at > t/025_stuck_on_old_timeline.pl line 84. > > When that happens, it bails out, and cleans everything up, doing an > immediate shutdown of all the nodes. The 'lsn' method was added by > commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in > v10 and later. I think maybe we can think of back-porting that to 9.6. Here's an updated set of patches. I removed the extra teardown_node calls per Kyotaro Horiguchi's request. I adopted his suggestion for setting a $perlbin variable from $^X, but found that $perlbin was undefined, so I split the incantation into two lines to fix that. I updated the code to use ->promote() instead of calling pg_promote(), and to use poll_query_until() afterwards to wait for promotion as suggested by Dilip. Also, I added a comment to the change in xlog.c. Then I tried to get things working on 9.6. There's a patch attached to back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also a version of the main patch attached with the necessary wal->xlog, lsn->location renaming. Unfortunately ... the new test case still fails on 9.6 in a way that looks an awful lot like the bug isn't actually fixed: LOG: primary server contains no more WAL on requested timeline 1 cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003: No such file or directory (repeated many times) I find that the same failure happens if I back-port the master version of the patch to v10 or v11, but if I back-port it to v12 or v13 then the test passes as expected. I haven't figured out what the issue is yet. I also noticed that if I back-port it to v12 and then revert the code change, the test still passes. So I think there may be something subtly wrong with this test case yet. Or maybe a code bug. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:
LOG: primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)
I find that the same failure happens if I back-port the master version
of the patch to v10 or v11,
I think this fails because prior to v12 the recovery target tli was not set to the latest by default because it was not GUC at that time. So after below fix it started passing on v11(only tested on v11 so far).
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
$node_cascade->init_from_backup($node_standby, $backup_name,
has_streaming => 1);
$node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
$node_cascade->init_from_backup($node_standby, $backup_name,
has_streaming => 1);
$node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));
But now it started passing even without the fix and the log says that it never tried to stream from primary using TL 1 so it never hit the defect location.
2021-06-09 12:11:08.618 IST [122456] LOG: entering standby mode
2021-06-09 12:11:08.622 IST [122456] LOG: restored log file "00000002.history" from archive
cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028
2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state reached at 0/3000000
2021-06-09 12:11:08.622 IST [122456] LOG: restored log file "00000002.history" from archive
cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028
2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state reached at 0/3000000
Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all. And after that, I will check the status on other older versions.
On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > 2021-06-09 12:11:08.618 IST [122456] LOG: entering standby mode > 2021-06-09 12:11:08.622 IST [122456] LOG: restored log file "00000002.history" from archive > cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: Nosuch file or directory > 2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028 > 2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state reached at 0/3000000 > > Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all. Andafter that, I will check the status on other older versions. Reason for the problem was that the "-Xnone" parameter was not accepted by "sub backup" in PostgresNode.pm so I created that for backpatch. With attached patches I am to make it pass in v12,v11,v10 (with fix) and fail (without fix). However, we will have to make some change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6, maybe we can delete the content from pg_wal after the backup, if we think that approach looks fine then I will make the changes for 9.6 as well. Note: for param backport for v12 and v11 same patch getting applied but for v10 due to some conflict we need a separate patch (both attached). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
On Wed, Jun 9, 2021 at 1:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > > 2021-06-09 12:11:08.618 IST [122456] LOG: entering standby mode > > 2021-06-09 12:11:08.622 IST [122456] LOG: restored log file "00000002.history" from archive > > cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: Nosuch file or directory > > 2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028 > > 2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state reached at 0/3000000 > > > > Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all. And after that, I will check the status on other older versions. > > Reason for the problem was that the "-Xnone" parameter was not > accepted by "sub backup" in PostgresNode.pm so I created that for > backpatch. With attached patches I am to make it pass in v12,v11,v10 > (with fix) and fail (without fix). However, we will have to make some > change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6, > maybe we can delete the content from pg_wal after the backup, if we > think that approach looks fine then I will make the changes for 9.6 as > well. > > Note: for param backport for v12 and v11 same patch getting applied > but for v10 due to some conflict we need a separate patch (both > attached). I have fixed it for 9.6 as well by removing the wal from the xlog directory. Attaching all the patches in single mail to avoid confusion. Note: v7-0001 applies to master, v13,v12 (but for v12 before this we need to apply backport) v12-v8-0001-Backport is same as v11-v8-0001-Backport (duplicated for version wise separation) v11-v8-0002 is same as v10-v8-0002 Basically, for v12 and v11 same backport patch works and for V11 and V10 same main patch works, still I duplicated them to avoid confusion. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
- 9.6-v8-0001-Back-port-a-few-PostgresNode.pm-methods.patch
- 9.6-v8-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch
- 11-v8-0001-Back-port-backup-param-in-PostgresNode.pm.patch
- 10-v8-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch
- 10-v8-0001-Back-port-backup-param-in-PostgresNode.pm.patch
- 11-v8-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch
- 12-v8-0001-Back-port-backup-param-in-PostgresNode.pm.patch
- v7-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Reason for the problem was that the "-Xnone" parameter was not > accepted by "sub backup" in PostgresNode.pm so I created that for > backpatch. With attached patches I am to make it pass in v12,v11,v10 > (with fix) and fail (without fix). However, we will have to make some > change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6, > maybe we can delete the content from pg_wal after the backup, if we > think that approach looks fine then I will make the changes for 9.6 as > well. Ah. I looked into this and found that this is because commit 081876d75ea15c3bd2ee5ba64a794fd8ea46d794 is new in master, so actually that change is absent in all the back-branches. I have now back-ported that portion of that commit to v13, v12, v11, and v10. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Reason for the problem was that the "-Xnone" parameter was not > accepted by "sub backup" in PostgresNode.pm so I created that for > backpatch. With attached patches I am to make it pass in v12,v11,v10 > (with fix) and fail (without fix). However, we will have to make some > change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6, > maybe we can delete the content from pg_wal after the backup, if we > think that approach looks fine then I will make the changes for 9.6 as > well. Got it. I have now committed the patch to all branches, after adapting your changes just a little bit. Thanks to you and Kyotaro-san for all the time spent on this. What a slog! -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Got it. I have now committed the patch to all branches, after adapting > your changes just a little bit. > Thanks to you and Kyotaro-san for all the time spent on this. What a slog! conchuela failed its first encounter with this test case: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25 That machine has a certain, er, history of flakiness; so this may not mean anything. Still, we'd better keep an eye out to see if the test needs more stabilization. regards, tom lane
At Wed, 09 Jun 2021 19:09:54 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Robert Haas <robertmhaas@gmail.com> writes: > > Got it. I have now committed the patch to all branches, after adapting > > your changes just a little bit. > > Thanks to you and Kyotaro-san for all the time spent on this. What a slog! > > conchuela failed its first encounter with this test case: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25 > > That machine has a certain, er, history of flakiness; so this may > not mean anything. Still, we'd better keep an eye out to see if > the test needs more stabilization. https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log ==~_~===-=-===~_~== .... > 2021-06-09 23:31:10.439 CEST [893820:1] LOG: started streaming WAL from primary at 0/2000000 on timeline 1 > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000002has already been removed The script 025_stuck_on_olde_timeline.pl (and I) forgets to set wal_keep_size(segments). regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl index 0d96bb3c15..25c2dff437 100644 --- a/src/test/recovery/t/025_stuck_on_old_timeline.pl +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -27,6 +27,7 @@ $perlbin =~ s{\\}{\\\\}g if ($TestLib::windows_os); my $archivedir_primary = $node_primary->archive_dir; $node_primary->append_conf('postgresql.conf', qq( archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' +wal_keep_size=128MB )); $node_primary->start; diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl index 0d96bb3c15..8099571299 100644 --- a/src/test/recovery/t/025_stuck_on_old_timeline.pl +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -27,6 +27,7 @@ $perlbin =~ s{\\}{\\\\}g if ($TestLib::windows_os); my $archivedir_primary = $node_primary->archive_dir; $node_primary->append_conf('postgresql.conf', qq( archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' +wal_keep_segments=8 )); $node_primary->start;
On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check > > > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log ==~_~===-=-===~_~== > .... > > 2021-06-09 23:31:10.439 CEST [893820:1] LOG: started streaming WAL from primary at 0/2000000 on timeline 1 > > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment000000010000000000000002 has already been removed > > The script 025_stuck_on_olde_timeline.pl (and I) forgets to set > wal_keep_size(segments). Thanks for the analysis and the patches. I have committed them. -- Robert Haas EDB: http://www.enterprisedb.com
At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check > > > > > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log ==~_~===-=-===~_~== > > .... > > > 2021-06-09 23:31:10.439 CEST [893820:1] LOG: started streaming WAL from primary at 0/2000000 on timeline 1 > > > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment000000010000000000000002 has already been removed > > > > The script 025_stuck_on_olde_timeline.pl (and I) forgets to set > > wal_keep_size(segments). > > Thanks for the analysis and the patches. I have committed them. Thanks for committing it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in >> Thanks for the analysis and the patches. I have committed them. > Thanks for committing it. Please note that conchuela and jacana are still failing ... conchuela's failure is evidently not every time, but this test definitely postdates the "fix": https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08 regards, tom lane
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > >> Thanks for the analysis and the patches. I have committed them. > > > Thanks for committing it. > > Please note that conchuela and jacana are still failing ... > > conchuela's failure is evidently not every time, but this test > definitely postdates the "fix": > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08 A different test is failing there. Maybe from different issue. > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/regress_log_002_archiving ==~_~===-=-===~_~== ... > # Postmaster PID for node "standby2" is 342349 > ### Promoting node "standby2" > # Running: pg_ctl -D /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata-l /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.logpromote > waiting for server to promote................................................................................................................ stoppedwaiting > pg_ctl: server did not promote in time > Bail out! system pg_ctl failed > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log ==~_~===-=-===~_~== ... > 2021-06-10 16:21:21.870 CEST [342350:9] LOG: received promote request > 2021-06-10 16:21:21.870 CEST [342350:10] LOG: redo done at 0/3030200 system usage: CPU: user: 0.00 s, system: 0.00 s,elapsed: 0.07 s > 2021-06-10 16:21:21.870 CEST [342350:11] LOG: last completed transaction was at log time 2021-06-10 16:21:21.010599+02 > 2021-06-10 16:21:21.893 CEST [342350:12] LOG: restored log file "000000010000000000000003" from archive > cp: /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00000003.history: Nosuch file or directory > 2021-06-10 16:21:21.896 CEST [342350:13] LOG: selected new timeline ID: 3 (log ends here) > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_primary.log ==~_~===-=-===~_~== ... > 2021-06-10 16:21:21.107 CEST [342322:4] 002_archiving.pl LOG: disconnection: session time: 0:00:00.022 user=pgbf database=postgreshost=[local] > 2021-06-10 16:23:21.965 CEST [342279:4] LOG: received immediate shutdown request So the standby2 was stuck after selecting the new timeline and before updating control file and its postmaster couldn't even respond to SIGQUIT. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 11 Jun 2021 14:07:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > conchuela's failure is evidently not every time, but this test > > definitely postdates the "fix": conchuela failed recovery_check this time, and > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08 > So the standby2 was stuck after selecting the new timeline and before > updating control file and its postmaster couldn't even respond to > SIGQUIT. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25 This is before the "fix" https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-08%2014%3A07%3A46 failed in pg_verifybackupCheck > ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/regress_log_003_corruption ==~_~===-=-===~_~== ... > # Failed test 'base backup ok' > # at t/003_corruption.pl line 115. > # Running: pg_verifybackup /home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails > pg_verifybackup: fatal: could not open file "/home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails/backup_manifest": Nosuch file or directory > not ok 38 - intact backup verified The manifest file is missing in backup. In this case also the servers failed to handle SIGQUIT. > ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log ==~_~===-=-===~_~== ... > 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_51792"0/B000000 TIMELINE 1 > 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_51792" 0/B000000TIMELINE 1 (log ends here) There seems like some hardware failure? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in tgl> Please note that conchuela and jacana are still failing ... I forgot jacana's case.. It is failing for the issue the first patch should have fixed. > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log ==~_~===-=-===~_~== ... > The system cannot find the path specified. > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG: archive command failed with exit code 1 > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL: The failed archive command was: /usr/bin/perl "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001" "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001" the cp_history_files exits with just "exit" for the files with that name, which should set status to 0. ActivePerl did so. If I specified nonexistent command like /hoge/perl, %ERRORLEVEL% is set to 3, not 1. I don't find what is happening there so far. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > tgl> Please note that conchuela and jacana are still failing ... > > I forgot jacana's case.. > > It is failing for the issue the first patch should have fixed. > > > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log ==~_~===-=-===~_~== > ... > > The system cannot find the path specified. > > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG: archive command failed with exit code 1 > > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL: The failed archive command was: /usr/bin/perl "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001" "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001" Wal file copying will not create a problem for us, but I noticed that it is failing in copying the history files as well and that is creating a problem. 2021-06-10 22:56:28.940 EDT [60c2d0db.1208:1] LOG: archive command failed with exit code 1 2021-06-10 22:56:28.940 EDT [60c2d0db.1208:2] DETAIL: The failed archive command was: /usr/bin/perl "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files" "pg_wal\\00000002.history" "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00000002.history" I have noticed that the archive command is failing in some other test case too (002_archiving_standby2.log), see below logs. ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log ==~_~===-=-===~_~== ... 0 file(s) copied. 2021-06-10 22:44:34.467 EDT [60c2ce10.1270:1] LOG: archive command failed with exit code 1 2021-06-10 22:44:34.467 EDT [60c2ce10.1270:2] DETAIL: The failed archive command was: copy "pg_wal\\00000003.history" "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\00000003.history" The system cannot find the path specified. 0 file(s) copied. 2021-06-10 22:44:35.478 EDT [60c2ce10.1270:3] LOG: archive command failed with exit code 1 2021-06-10 22:44:35.478 EDT [60c2ce10.1270:4] DETAIL: The failed archive command was: copy "pg_wal\\00000003.history" "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\00000003.history" 2021-06-10 22:44:36.113 EDT [60c2ce0c.283c:5] LOG: received immediate shutdown request 2021-06-10 22:44:36.129 EDT [60c2ce0c.283c:6] LOG: database system is shut down I am not able to figure out why the archive command is failing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> writes: > On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >>> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log ==~_~===-=-===~_~== >>> ... >>> The system cannot find the path specified. >>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG: archive command failed with exit code 1 >>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL: The failed archive command was: /usr/bin/perl "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001" "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001" > Wal file copying will not create a problem for us, but I noticed that > it is failing in copying the history files as well and that is > creating a problem. I think jacana uses msys[2?], so this likely indicates a problem in path sanitization for the archive command. Andrew, any advice? regards, tom lane
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: >> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log ==~_~===-=-===~_~== >> ... >> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_51792"0/B000000 TIMELINE 1 >> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_51792" 0/B000000TIMELINE 1 >> (log ends here) > There seems like some hardware failure? conchuela has definitely evinced flakiness before. Not sure what's up with it, but I have no problem with writing off non-repeatable failures from that machine. In any case, it's now passed half a dozen times in a row on HEAD, so I think we can say that it's okay with this test. That leaves jacana, which I'm betting has a Windows portability issue with the new test. regards, tom lane
On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote: > I think jacana uses msys[2?], so this likely indicates a problem > in path sanitization for the archive command. Andrew, any advice? Err, something around TestLib::perl2host()? -- Michael
Вложения
On 6/12/21 3:48 AM, Michael Paquier wrote: > On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote: >> I think jacana uses msys[2?], so this likely indicates a problem >> in path sanitization for the archive command. Andrew, any advice? > Err, something around TestLib::perl2host()? I'm working on a fix for this. Yes it includes perl2host, but that's not enough :-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 6/12/21 7:31 AM, Andrew Dunstan wrote: > On 6/12/21 3:48 AM, Michael Paquier wrote: >> On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote: >>> I think jacana uses msys[2?], so this likely indicates a problem >>> in path sanitization for the archive command. Andrew, any advice? >> Err, something around TestLib::perl2host()? > > I'm working on a fix for this. Yes it includes perl2host, but that's not > enough :-) > > I have pushed a fix, tested on a replica of fairywren/drongo, cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I have pushed a fix, tested on a replica of fairywren/drongo, This bit seems a bit random: # WAL segment, this is enough to guarantee that the history file was # archived. my $archive_wait_query = - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " . + "FROM pg_stat_archiver"; $node_standby->poll_query_until('postgres', $archive_wait_query) or die "Timed out while waiting for WAL segment to be archived"; my $last_archived_wal_file = $walfile_to_be_archived; I wonder whether that is a workaround for the poll_query_until bug I proposed to fix at [1]. regards, tom lane [1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us
On 6/12/21 10:20 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I have pushed a fix, tested on a replica of fairywren/drongo, > This bit seems a bit random: > > # WAL segment, this is enough to guarantee that the history file was > # archived. > my $archive_wait_query = > - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; > + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " . > + "FROM pg_stat_archiver"; > $node_standby->poll_query_until('postgres', $archive_wait_query) > or die "Timed out while waiting for WAL segment to be archived"; > my $last_archived_wal_file = $walfile_to_be_archived; > > I wonder whether that is a workaround for the poll_query_until bug > I proposed to fix at [1]. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us No, it's because I found it annoying and confusing that there was an invisible result when last_archived_wal is null. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 6/12/21 10:20 AM, Tom Lane wrote: >> I wonder whether that is a workaround for the poll_query_until bug >> I proposed to fix at [1]. > No, it's because I found it annoying and confusing that there was an > invisible result when last_archived_wal is null. OK. But it makes me itch a bit that this one wait-for-wal-to-be- processed query looks different from all the other ones. regards, tom lane
On 6/12/21 1:07 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 6/12/21 10:20 AM, Tom Lane wrote: >>> I wonder whether that is a workaround for the poll_query_until bug >>> I proposed to fix at [1]. >> No, it's because I found it annoying and confusing that there was an >> invisible result when last_archived_wal is null. > OK. But it makes me itch a bit that this one wait-for-wal-to-be- > processed query looks different from all the other ones. > > I'm happy to bring the other two queries that look like this into line with this one if you like. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 6/12/21 1:07 PM, Tom Lane wrote: >> OK. But it makes me itch a bit that this one wait-for-wal-to-be- >> processed query looks different from all the other ones. > I'm happy to bring the other two queries that look like this into line > with this one if you like. I see a lot more than two --- grepping for poll_query_until with a test involving a LSN comparison finds a bunch. Are we sure that there are only three in which the LSN could be null? How much does it really matter if it is? regards, tom lane
On 6/12/21 1:54 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 6/12/21 1:07 PM, Tom Lane wrote: >>> OK. But it makes me itch a bit that this one wait-for-wal-to-be- >>> processed query looks different from all the other ones. >> I'm happy to bring the other two queries that look like this into line >> with this one if you like. > I see a lot more than two --- grepping for poll_query_until with > a test involving a LSN comparison finds a bunch. Are we sure that > there are only three in which the LSN could be null? Well, I'm counting places that specifically compare it with pg_stat_archiver.last_archived_wal. > How much > does it really matter if it is? > > It makes it harder to tell if there was any result at all when there's a failure. If it bugs you that much I can revert just that line. Now that I have fixed the immediate issue it matters less. I'm not prepared to put in a lot of effort here, though. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2021-06-10 01:09, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Got it. I have now committed the patch to all branches, after adapting >> your changes just a little bit. >> Thanks to you and Kyotaro-san for all the time spent on this. What a slog! > > conchuela failed its first encounter with this test case: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25 > > That machine has a certain, er, history of flakiness; so this may > not mean anything. Still, we'd better keep an eye out to see if > the test needs more stabilization. Yes, the flakiness is caused by the very weird filesystem (HAMMERFS) that has some weird garbage collection handling that sometimes fills up the disk and then never recovers automatically. I have tried to put in the cleanup-utility for HAMMERFS in cron to run at a schedule but it's isn't 100% fool proof. So I am going to upgrade to a newer version of DragonflyBSD in the near future. /Mikael
On Sat, Jun 12, 2021 at 10:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > I have pushed a fix, tested on a replica of fairywren/drongo, > > This bit seems a bit random: > > # WAL segment, this is enough to guarantee that the history file was > # archived. > my $archive_wait_query = > - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; > + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " . > + "FROM pg_stat_archiver"; > $node_standby->poll_query_until('postgres', $archive_wait_query) > or die "Timed out while waiting for WAL segment to be archived"; > my $last_archived_wal_file = $walfile_to_be_archived; > > I wonder whether that is a workaround for the poll_query_until bug > I proposed to fix at [1]. I found that a bit random too, but it wasn't the only part of the patch I found a bit random. Like, what can this possibly be doing? +if ($^O eq 'msys') +{ + $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); +} The idea here is apparently that on msys, the directory name that is part of $^X needs to be passed through perl2host, but the file name that is part of the same $^X needs to NOT be passed through perl2host. Is $^X really that broken? If so, I think some comments are in order. +local $ENV{PERL_BADLANG}=0; Similarly here. There's not a single other reference to PERL_BADLANG in the repository, so if we need this one here, there should be a comment explaining why this is different from all the places where we don't need it. On those occasions when I commit TAP test cases, I do try to think about whether they are going to be portable, but I find these kinds of changes indistinguishable from magic. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/14/21 11:52 AM, Robert Haas wrote: > On Sat, Jun 12, 2021 at 10:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> I have pushed a fix, tested on a replica of fairywren/drongo, >> This bit seems a bit random: >> >> # WAL segment, this is enough to guarantee that the history file was >> # archived. >> my $archive_wait_query = >> - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; >> + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " . >> + "FROM pg_stat_archiver"; >> $node_standby->poll_query_until('postgres', $archive_wait_query) >> or die "Timed out while waiting for WAL segment to be archived"; >> my $last_archived_wal_file = $walfile_to_be_archived; >> >> I wonder whether that is a workaround for the poll_query_until bug >> I proposed to fix at [1]. This has been reverted. > I found that a bit random too, but it wasn't the only part of the > patch I found a bit random. Like, what can this possibly be doing? > > +if ($^O eq 'msys') > +{ > + $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); > +} > > The idea here is apparently that on msys, the directory name that is > part of $^X needs to be passed through perl2host, but the file name > that is part of the same $^X needs to NOT be passed through perl2host. > Is $^X really that broken? If so, I think some comments are in order. $^X is not at all broken. The explanation here is pretty simple - the argument to perl2host is meant to be a directory. If we're going to accomodate plain files then we have some more work to do in TestLib. > +local $ENV{PERL_BADLANG}=0; > > Similarly here. There's not a single other reference to PERL_BADLANG > in the repository, so if we need this one here, there should be a > comment explaining why this is different from all the places where we > don't need it. Here's why this is different: this is the only place that we invoke the msys perl in this way (i.e. from a non-msys aware environment - the binaries we build are not msys-aware). We need to do that if for no other reason than that it might well be the only perl available. Doing so makes it complain loudly about missing locale info. Setting this variable makes it shut up. I can add a comment on that if you like. > On those occasions when I commit TAP test cases, I do try to think > about whether they are going to be portable, but I find these kinds of > changes indistinguishable from magic. Part of the trouble is that I've been living and breathing some of these issues so much recently that I forget that what might be fairly obvious to me isn't to others. I assure you there is not the faintest touch of pixy dust involved. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote: > $^X is not at all broken. > > The explanation here is pretty simple - the argument to perl2host is > meant to be a directory. If we're going to accomodate plain files then > we have some more work to do in TestLib. This explanation seems to contradict the documentation in TestLib.pm, which makes no mention of any such restriction. > > +local $ENV{PERL_BADLANG}=0; > > > > Similarly here. There's not a single other reference to PERL_BADLANG > > in the repository, so if we need this one here, there should be a > > comment explaining why this is different from all the places where we > > don't need it. > > Here's why this is different: this is the only place that we invoke the > msys perl in this way (i.e. from a non-msys aware environment - the > binaries we build are not msys-aware). We need to do that if for no > other reason than that it might well be the only perl available. Doing > so makes it complain loudly about missing locale info. Setting this > variable makes it shut up. I can add a comment on that if you like. Yes, please, but perhaps you'd like to post patches for discussion first instead of committing directly. > Part of the trouble is that I've been living and breathing some of these > issues so much recently that I forget that what might be fairly obvious > to me isn't to others. I assure you there is not the faintest touch of > pixy dust involved. Every pixie with whom I've spoken today says otherwise! -- Robert Haas EDB: http://www.enterprisedb.com
On 6/14/21 1:11 PM, Robert Haas wrote: > On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> $^X is not at all broken. >> >> The explanation here is pretty simple - the argument to perl2host is >> meant to be a directory. If we're going to accomodate plain files then >> we have some more work to do in TestLib. > This explanation seems to contradict the documentation in TestLib.pm, > which makes no mention of any such restriction. Heres a snippet: sub perl2host { my ($subject) = @_; ... if (chdir $subject) Last time I looked you can't chdir to anything except a directory. > >>> +local $ENV{PERL_BADLANG}=0; >>> >>> Similarly here. There's not a single other reference to PERL_BADLANG >>> in the repository, so if we need this one here, there should be a >>> comment explaining why this is different from all the places where we >>> don't need it. >> Here's why this is different: this is the only place that we invoke the >> msys perl in this way (i.e. from a non-msys aware environment - the >> binaries we build are not msys-aware). We need to do that if for no >> other reason than that it might well be the only perl available. Doing >> so makes it complain loudly about missing locale info. Setting this >> variable makes it shut up. I can add a comment on that if you like. > Yes, please, but perhaps you'd like to post patches for discussion > first instead of committing directly. I was trying to get the buildfarm green again. There have been plenty of times when small patches directly for such fixes have been committed directly. And that's the only circumstance when I do. cheers andrew
On Mon, Jun 14, 2021 at 1:50 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Heres a snippet: > > sub perl2host > { > my ($subject) = @_; > ... > if (chdir $subject) > > Last time I looked you can't chdir to anything except a directory. OK, but like I said, you can't tell that from the documentation. The documentation says: "Translate a virtual file name to a host file name. Currently, this is a no-op except for the case of Perl=msys and host=mingw32. The subject need not exist, but its parent or grandparent directory must exist unless cygpath is available." If you look just at that, there's nothing that would lead you to believe that it has to be a directory name. > I was trying to get the buildfarm green again. There have been plenty of > times when small patches directly for such fixes have been committed > directly. And that's the only circumstance when I do. I wasn't intending to criticize your work on this. I really appreciate it, in fact, as I also said to you off-list. I do think that there were some small things in those patches where a little bit of quick discussion might have been useful: e.g. should the archive_command change have gone in in the first place? Do we need any comments to explain the fixes? But it's not like it's a big deal either. I'm certainly not disagreeing with the goodness of making the buildfarm green as expediently as possible. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/14/21 1:50 PM, Andrew Dunstan wrote: > On 6/14/21 1:11 PM, Robert Haas wrote: >> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>> $^X is not at all broken. >>> >>> The explanation here is pretty simple - the argument to perl2host is >>> meant to be a directory. If we're going to accomodate plain files then >>> we have some more work to do in TestLib. >> This explanation seems to contradict the documentation in TestLib.pm, >> which makes no mention of any such restriction. > > Heres a snippet: > > > sub perl2host > { > my ($subject) = @_; > ... > if (chdir $subject) > > > Last time I looked you can't chdir to anything except a directory. Actually, I take it back, it does work for a file. I'll change it. I probably did this when something else wasn't working. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 6/14/21 3:32 PM, Andrew Dunstan wrote: > On 6/14/21 1:50 PM, Andrew Dunstan wrote: >> On 6/14/21 1:11 PM, Robert Haas wrote: >>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>>> $^X is not at all broken. >>>> >>>> The explanation here is pretty simple - the argument to perl2host is >>>> meant to be a directory. If we're going to accomodate plain files then >>>> we have some more work to do in TestLib. >>> This explanation seems to contradict the documentation in TestLib.pm, >>> which makes no mention of any such restriction. >> Heres a snippet: >> >> >> sub perl2host >> { >> my ($subject) = @_; >> ... >> if (chdir $subject) >> >> >> Last time I looked you can't chdir to anything except a directory. > > > Actually, I take it back, it does work for a file. I'll change it. I > probably did this when something else wasn't working. So, will you feel happier with this applied? I haven't tested it yet but I'm confident it will work. diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl index e4e58cb8ab..3e19bc4c50 100644 --- a/src/test/recovery/t/025_stuck_on_old_timeline.pl +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -24,11 +24,11 @@ my $node_primary = get_new_node('primary'); # the timeline history file reaches the archive but before any of the WAL files # get there. $node_primary->init(allows_streaming => 1, has_archiving => 1); -my $perlbin = $^X; -if ($^O eq 'msys') -{ - $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); -} + +# Note: consistent use of forward slashes here avoids any escaping problems +# that arise from use of backslashes. That means we need to double-quote all +# the paths in the archive_command +my $perlbin = TestLib::perl2host(^X); $perlbin =~ s!\\!/!g if $TestLib::windows_os; my $archivedir_primary = $node_primary->archive_dir; $archivedir_primary =~ s!\\!/!g if $TestLib::windows_os; @@ -36,6 +36,8 @@ $node_primary->append_conf('postgresql.conf', qq( archive_command = '"$perlbin" "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' wal_keep_size=128MB )); +# make sure that Msys perl doesn't complain about difficulty in setting locale +# when called this way. local $ENV{PERL_BADLANG}=0; $node_primary->start; cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > I think jacana uses msys[2?], so this likely indicates a problem > in path sanitization for the archive command. Andrew, any advice? Thanks for fixing it. # I haven't still succeed to run TAP tests on MSYS2 environment. I # cannot install IPC::Run for msys perl.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote: > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in >> I think jacana uses msys[2?], so this likely indicates a problem >> in path sanitization for the archive command. Andrew, any advice? > Thanks for fixing it. > > # I haven't still succeed to run TAP tests on MSYS2 environment. I > # cannot install IPC::Run for msys perl.. > > regards. > Unpack the attached somewhere and point your PERL5LIB at it. That's all I do. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Вложения
On Mon, Jun 14, 2021 at 3:47 PM Andrew Dunstan <andrew@dunslane.net> wrote: > So, will you feel happier with this applied? I haven't tested it yet but > I'm confident it will work. I'm not all that unhappy now, but yeah, that looks like an improvement to me. I'm still afraid that I will keep writing tests that blow up on Windows but that's a bigger problem than we can hope to fix on this thread, and I do think this discussion has helped. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
At Tue, 15 Jun 2021 07:54:49 -0400, Andrew Dunstan <andrew@dunslane.net> wrote in > > On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote: > > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > >> I think jacana uses msys[2?], so this likely indicates a problem > >> in path sanitization for the archive command. Andrew, any advice? > > Thanks for fixing it. > > > > # I haven't still succeed to run TAP tests on MSYS2 environment. I > > # cannot install IPC::Run for msys perl.. > > > > regards. > > > > > Unpack the attached somewhere and point your PERL5LIB at it. That's all > I do. Thanks a lot, Andrew. I get to run the TAP test with it and saw the same error with jacana. regards. -- Kyotaro Horiguchi NTT Open Source Software Center