When a cascading standby launches a new walsender, it fetches the
current recovery timeline:
/* * Use the recovery target timeline ID during recovery */if (am_cascading_walsender) ThisTimeLineID =
GetRecoveryTargetTLI();
Comment in GetRecoveryTargetTLI() does this:
/* RecoveryTargetTLI doesn't change so we need no lock to copy it */return XLogCtl->RecoveryTargetTLI;
That comment is not true. RecoveryTargetTLI can change during recovery,
if you set recovery_target_timeline='latest'. In 'latest' mode, when the
(apparent) end of WAL is reached, the archive is scanned for any new
timeline history files that may have appeared. If a new timeline is
found, RecoveryTargetTLI is updated, and recovery is continued on the
new timeline.
Aside from the missing locking, I wonder what that does to a cascaded
standby. If there is an active walsender running while RecoveryTargetTLI
is changed, I think what will happen is that the walsender will continue
to stream WAL from the old timeline, but because the startup process is
now actually replaying from a different timeline, the walsender will
send bogus WAL to the standby.
When a standby ends recovery, creates a new timeline, and switches to
normal operation, postmaster terminates all walsenders because of the
timeline change. But don't we have a race condition there, with similar
effect? It might take a while for a walsender to die, and in that
window, it might send bogus WAL to the cascaded standby.
- Heikki