Обсуждение: standby apply lag on inactive servers
A customer of ours complained that if you have an inactive primary, monitoring the apply lag on a standby reports monotonically increasing lag. The reason for this is that the apply lag is only updated on COMMIT records, which of course don't occur in inactive servers. But CHECKPOINT records do occur, so the WAL insert pointer continues to move forward, which is what causes the spurious lag. (I think newer releases are protected from this problem because they don't emit checkpoints during periods of inactivity. I didn't verify this.) This patch fixes the problem by using the checkpoint timestamp to update the lag tracker in the standby. This requires a little change in where this update is invoked, because previously it was done only for the XACT rmgr; this makes the patch a little bigger than it should. -- Álvaro Herrera PostgreSQL Expert, https://www.2ndQuadrant.com/
Вложения
On 2020-Jan-10, Alvaro Herrera wrote: > A customer of ours complained that if you have an inactive primary, > monitoring the apply lag on a standby reports monotonically increasing > lag. The reason for this is that the apply lag is only updated on > COMMIT records, which of course don't occur in inactive servers. > But CHECKPOINT records do occur, so the WAL insert pointer continues to > move forward, which is what causes the spurious lag. > > (I think newer releases are protected from this problem because they > don't emit checkpoints during periods of inactivity. I didn't verify > this.) > > This patch fixes the problem by using the checkpoint timestamp to update > the lag tracker in the standby. This requires a little change in where > this update is invoked, because previously it was done only for the XACT > rmgr; this makes the patch a little bigger than it should. Here's a version of the patch that applies to current master. It does fix the problem that CHECKPOINT wal records are not considered when determining time-of-latest-record. However, it does *not* fix the monitoring problem I mentioned (which relied on comparing pg_last_xact_replay_timestamp() to 'now()') ... because commit 6ef2eba3f57f (pg10) made an idle server not emit checkpoint records anymore. That is, my parenthical remark was completely wrong: the new versions not only are "protected", but also this fix doesn't fix them. Luckily, the way to fix monitoring for servers of versions 10 and later is to use the new replay_lag (etc) columns in pg_stat_replication, commit 6912acc04f0b (also pg10). I am inclined to apply this to all branches unless there are strong objections, because the current code seems pretty arbitrary anyway. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Actually looking again, getRecordTimestamp is looking pretty strange. It looks much more natural by using nested switch/case blocks, as with this diff. I think the compiler does a better job this way too. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello. At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > Actually looking again, getRecordTimestamp is looking pretty strange. > It looks much more natural by using nested switch/case blocks, as with > this diff. I think the compiler does a better job this way too. Agreed. Anyway I looked the latest version. The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But could be included for consistency.) The timestamp of a checkpoint record is the start time of a checkpoint (and doesn't have subseconds part, but it's not a problem.). That means that the timestamp is usually far behind the time at the record has been inserted. As the result the stored timestamp can go back by a significant internval. I don't think that causes an actual problem but the movement looks wierd as the result of pg_last_xact_replay_timestamp(). Asides from the backward movement, a timestamp from other than commit/abort records in recvoeryLastXTime affects the following code. xlog.c:7329: (similar code exists at line 9332) > ereport(LOG, > (errmsg("redo done at %X/%X", > (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr))); > xtime = GetLatestXTime(); > if (xtime) > ereport(LOG, > (errmsg("last completed transaction was at log time %s", > timestamptz_to_str(xtime)))); This code assumes (and the name GetLatestXTime() suggests, I first noticed that here..) that the timestamp comes from commit/abort logs, so otherwise it shows a wrong timestamp. We shouldn't update the variable by other than that kind of records. If (I don't think that comes true..) we set the timestamp from other than that kind of record, the names and the comments of the functions should be changed. > /* > * Save timestamp of latest processed commit/abort record. > * > * We keep this in XLogCtl, not a simple static variable, so that it can be > * seen by processes other than the startup process. Note in particular > * that CreateRestartPoint is executed in the checkpointer. > */ > static void > SetLatestXTime(TimestampTz xtime) ... > /* > * Fetch timestamp of latest processed commit/abort record. > */ > TimestampTz > GetLatestXTime(void) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020-Jan-28, Kyotaro Horiguchi wrote: > At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > > Actually looking again, getRecordTimestamp is looking pretty strange. > > It looks much more natural by using nested switch/case blocks, as with > > this diff. I think the compiler does a better job this way too. > > Agreed. Anyway I looked the latest version. > > The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and > XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But > could be included for consistency.) Hello, thanks for looking. > The timestamp of a checkpoint record is the start time of a checkpoint > (and doesn't have subseconds part, but it's not a problem.). That > means that the timestamp is usually far behind the time at the record > has been inserted. As the result the stored timestamp can go back by a > significant internval. I don't think that causes an actual problem but > the movement looks wierd as the result of > pg_last_xact_replay_timestamp(). Ouch ... yeah, it should be set only if it doesn't go backwards. > xlog.c:7329: (similar code exists at line 9332) > > ereport(LOG, > > (errmsg("redo done at %X/%X", > > (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr))); > > xtime = GetLatestXTime(); > > if (xtime) > > ereport(LOG, > > (errmsg("last completed transaction was at log time %s", > > timestamptz_to_str(xtime)))); > > This code assumes (and the name GetLatestXTime() suggests, I first > noticed that here..) that the timestamp comes from commit/abort logs, > so otherwise it shows a wrong timestamp. We shouldn't update the > variable by other than that kind of records. Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for "last transaction completion" but having it include unrelated things such as restore points is terrible. One idea is to should split it in two: one exclusively for transaction commit/abort, and another for all WAL activity. That way, the former can be used for that message, and the latter for standby replay reports. However, that might be overengineering, if the only thing that the former is that one LOG message; instead changing the log message to state that the time is for other activity, as you suggest, is simpler and has no downside that I can see. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-27, Alvaro Herrera wrote: > Actually looking again, getRecordTimestamp is looking pretty strange. > It looks much more natural by using nested switch/case blocks, as with > this diff. I think the compiler does a better job this way too. I hadn't noticed I forgot to attach the diff here :-( -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
At Tue, 28 Jan 2020 11:18:50 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jan-27, Alvaro Herrera wrote: > > > Actually looking again, getRecordTimestamp is looking pretty strange. > > It looks much more natural by using nested switch/case blocks, as with > > this diff. I think the compiler does a better job this way too. > > I hadn't noticed I forgot to attach the diff here :-( Yeay, that patch bases the apply-lag patch:) And contains XLOG_CHECKPOINT_*. But otherwise looks good to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 28 Jan 2020 11:18:14 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > > xlog.c:7329: (similar code exists at line 9332) > > > ereport(LOG, > > > (errmsg("redo done at %X/%X", > > > (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr))); > > > xtime = GetLatestXTime(); > > > if (xtime) > > > ereport(LOG, > > > (errmsg("last completed transaction was at log time %s", > > > timestamptz_to_str(xtime)))); > > > > This code assumes (and the name GetLatestXTime() suggests, I first > > noticed that here..) that the timestamp comes from commit/abort logs, > > so otherwise it shows a wrong timestamp. We shouldn't update the > > variable by other than that kind of records. > > Hmm, that's terrible. GetLatestXTime() being displayed user-visibly for > "last transaction completion" but having it include unrelated things > such as restore points is terrible. One idea is to should split it in > two: one exclusively for transaction commit/abort, and another for all > WAL activity. That way, the former can be used for that message, and > the latter for standby replay reports. However, that might be > overengineering, if the only thing that the former is that one LOG > message; instead changing the log message to state that the time is for > other activity, as you suggest, is simpler and has no downside that I > can see. Perhaps we can use ControlData->checkPointCopy.time instead. It misses checkpoint records intermittently but works in general. But as more significant issue, nowadays PostgreSQL doesn't run a checkpoint if it is really inactive (that is, if no "important" WAL records have issued.). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020-Jan-29, Kyotaro Horiguchi wrote: > But as more significant issue, nowadays PostgreSQL doesn't run a > checkpoint if it is really inactive (that is, if no "important" WAL > records have issued.). Yeah, I mentioned this in message <20200127203419.GA15216@alvherre.pgsql>. The solution for monitoring purposes is to use the new "lag" columns in pg_stat_replication. But that's not available in older releases (prior to 10), so this change is still useful. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-28, Kyotaro Horiguchi wrote: > The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and > XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But > could be included for consistency.) Hmm I think I should definitely include those. > The timestamp of a checkpoint record is the start time of a checkpoint > (and doesn't have subseconds part, but it's not a problem.). That > means that the timestamp is usually far behind the time at the record > has been inserted. As the result the stored timestamp can go back by a > significant internval. I don't think that causes an actual problem but > the movement looks wierd as the result of > pg_last_xact_replay_timestamp(). A problem I see with this is that setting the timestamp is done with XLogCtl->lock held; since now we need to make the update conditional, we'd have to read the current value, compare with the checkpoint time, then set, all with the spinlock held. That seems way too expensive. A compromise might be to do the compare only when it's done for checkpoint. These occur seldom enough that it shouldn't be a problem (as opposed to commit records, which can be very frequent). > Asides from the backward movement, a timestamp from other than > commit/abort records in recvoeryLastXTime affects the following code. > > xlog.c:7329: (similar code exists at line 9332) > > ereport(LOG, > > (errmsg("redo done at %X/%X", > > (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr))); > > xtime = GetLatestXTime(); > > if (xtime) > > ereport(LOG, > > (errmsg("last completed transaction was at log time %s", > > timestamptz_to_str(xtime)))); > > This code assumes (and the name GetLatestXTime() suggests, I first > noticed that here..) that the timestamp comes from commit/abort logs, > so otherwise it shows a wrong timestamp. We shouldn't update the > variable by other than that kind of records. Thinking about this some more, I think we should do keep the message the same backbranches (avoid breaking anything that might be reading the log -- a remote but not inexistent possibility), and adjust the message in master to be something like "last timestamped WAL activity at time %s", and document that it means commit, abort, restore label, checkpoint. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Wed, 29 Jan 2020 19:11:31 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jan-28, Kyotaro Horiguchi wrote: > > > The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and > > XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But > > could be included for consistency.) > > Hmm I think I should definitely include those. I agree to that, given the following change of log messages. > > The timestamp of a checkpoint record is the start time of a checkpoint > > (and doesn't have subseconds part, but it's not a problem.). That > > means that the timestamp is usually far behind the time at the record > > has been inserted. As the result the stored timestamp can go back by a > > significant internval. I don't think that causes an actual problem but > > the movement looks wierd as the result of > > pg_last_xact_replay_timestamp(). > > A problem I see with this is that setting the timestamp is done with > XLogCtl->lock held; since now we need to make the update conditional, > we'd have to read the current value, compare with the checkpoint time, > then set, all with the spinlock held. That seems way too expensive. > > A compromise might be to do the compare only when it's done for > checkpoint. These occur seldom enough that it shouldn't be a problem > (as opposed to commit records, which can be very frequent). I think we don't need to that, given the following change. > > Asides from the backward movement, a timestamp from other than > > commit/abort records in recvoeryLastXTime affects the following code. > > > > xlog.c:7329: (similar code exists at line 9332) > > > ereport(LOG, > > > (errmsg("redo done at %X/%X", > > > (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr))); > > > xtime = GetLatestXTime(); > > > if (xtime) > > > ereport(LOG, > > > (errmsg("last completed transaction was at log time %s", > > > timestamptz_to_str(xtime)))); > > > > This code assumes (and the name GetLatestXTime() suggests, I first > > noticed that here..) that the timestamp comes from commit/abort logs, > > so otherwise it shows a wrong timestamp. We shouldn't update the > > variable by other than that kind of records. > > Thinking about this some more, I think we should do keep the message the > same backbranches (avoid breaking anything that might be reading the log > -- a remote but not inexistent possibility), and adjust the message in > master to be something like "last timestamped WAL activity at time %s", > and document that it means commit, abort, restore label, checkpoint. Agreed about backbranches. I'd like to preserve the word "transaction" as it is more familiar to users. How about something like the follows? "transactions are completed up to log time %s" regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020-Jan-30, Kyotaro Horiguchi wrote: > Agreed about backbranches. I'd like to preserve the word "transaction" > as it is more familiar to users. How about something like the follows? > > "transactions are completed up to log time %s" That's a good point. I used the phrase "transaction activity", which seems sufficiently explicit to me. So, the attached is the one for master; in back branches I would use the same (plus minor conflict fixes), except that I would drop the message wording changes. Thanks for the reviews so far, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
At Thu, 30 Jan 2020 17:45:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Jan-30, Kyotaro Horiguchi wrote: > > > Agreed about backbranches. I'd like to preserve the word "transaction" > > as it is more familiar to users. How about something like the follows? > > > > "transactions are completed up to log time %s" > > That's a good point. I used the phrase "transaction activity", which > seems sufficiently explicit to me. > > So, the attached is the one for master; in back branches I would use the > same (plus minor conflict fixes), except that I would drop the message > wording changes. > > Thanks for the reviews so far, My pleasure. regads. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/01/31 5:45, Alvaro Herrera wrote: > On 2020-Jan-30, Kyotaro Horiguchi wrote: > >> Agreed about backbranches. I'd like to preserve the word "transaction" >> as it is more familiar to users. How about something like the follows? >> >> "transactions are completed up to log time %s" > > That's a good point. I used the phrase "transaction activity", which > seems sufficiently explicit to me. > > So, the attached is the one for master; in back branches I would use the > same (plus minor conflict fixes), except that I would drop the message > wording changes. You're thinking to apply this change to the back branches? Sorry if my understanding is not right. But I don't think that back-patch is ok because it changes the documented existing behavior of pg_last_xact_replay_timestamp(). So it looks like the behavior change not a bug fix. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote: > You're thinking to apply this change to the back branches? Sorry > if my understanding is not right. But I don't think that back-patch > is ok because it changes the documented existing behavior > of pg_last_xact_replay_timestamp(). So it looks like the behavior > change not a bug fix. Yeah, I am thinking in backpatching it. The documented behavior is already not what the code does. Do you have a situation where this change would break something? If so, can you please explain what it is? I think (and I said it upthread) a 100% complete fix involves tracking two timestamps rather than one. I was thinking that that would be too invasive because it changes XLogCtlData shmem struct ... but that struct is private to xlog.c, so I think it's fine to change the struct. The problem though is that the user-visible change that I want to achieve is pg_last_xact_replay_timestamp(), and it would be obviously wrong to use the new XLogCtlData field rather than the existing one, as that would be a behavior change in the same sense that you're now complaining about. So I would achieve nothing. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 22:40, Alvaro Herrera wrote: > On 2020-Jan-31, Fujii Masao wrote: > >> You're thinking to apply this change to the back branches? Sorry >> if my understanding is not right. But I don't think that back-patch >> is ok because it changes the documented existing behavior >> of pg_last_xact_replay_timestamp(). So it looks like the behavior >> change not a bug fix. > > Yeah, I am thinking in backpatching it. The documented behavior is > already not what the code does. Maybe you thought this because getRecordTimestamp() extracts the timestamp from even WAL record of a restore point? That is, you're concerned about that pg_last_xact_replay_timestamp() returns the timestamp of not only commit/abort record but also restore point one. Right? As far as I read the code, this problem doesn't occur because SetLatestXTime() is called only for commit/abort records, in recoveryStopsAfter(). No? > Do you have a situation where this > change would break something? If so, can you please explain what it is? For example, use the return value of pg_last_xact_replay_timestamp() (and also the timestamp in the log message output at the end of recovery) as a HINT when setting recovery_target_time later. Use it to compare with the timestamp retrieved from the master server, in order to monitor the replication delay. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On 2020-Jan-31, Fujii Masao wrote: > On 2020/01/31 22:40, Alvaro Herrera wrote: > > On 2020-Jan-31, Fujii Masao wrote: > > > > > You're thinking to apply this change to the back branches? Sorry > > > if my understanding is not right. But I don't think that back-patch > > > is ok because it changes the documented existing behavior > > > of pg_last_xact_replay_timestamp(). So it looks like the behavior > > > change not a bug fix. > > > > Yeah, I am thinking in backpatching it. The documented behavior is > > already not what the code does. > > Maybe you thought this because getRecordTimestamp() extracts the > timestamp from even WAL record of a restore point? That is, you're > concerned about that pg_last_xact_replay_timestamp() returns the > timestamp of not only commit/abort record but also restore point one. > Right? right. > As far as I read the code, this problem doesn't occur because > SetLatestXTime() is called only for commit/abort records, in > recoveryStopsAfter(). No? ... uh, wow, you're right about that too. IMO this is extremely fragile, easy to break, and under-documented. But you're right, there's no bug there at present. > > Do you have a situation where this > > change would break something? If so, can you please explain what it is? > > For example, use the return value of pg_last_xact_replay_timestamp() > (and also the timestamp in the log message output at the end of > recovery) as a HINT when setting recovery_target_time later. Hmm. I'm not sure how you would use it in that way. I mean, I understand how it *can* be used that way, but it seems too fragile to be done in practice, in a scenario that's not just laboratory games. > Use it to compare with the timestamp retrieved from the master server, > in order to monitor the replication delay. That's precisely the use case that I'm aiming at. The timestamp currently is not useful because this usage breaks when the primary is inactive (no COMMIT records occur). During such periods of inactivity, CHECKPOINT records would keep the "last xtime" current. This has actually happened in a production setting, it's not a thought experiment. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/01/31 23:47, Alvaro Herrera wrote: > On 2020-Jan-31, Fujii Masao wrote: >> On 2020/01/31 22:40, Alvaro Herrera wrote: >>> On 2020-Jan-31, Fujii Masao wrote: >>> >>>> You're thinking to apply this change to the back branches? Sorry >>>> if my understanding is not right. But I don't think that back-patch >>>> is ok because it changes the documented existing behavior >>>> of pg_last_xact_replay_timestamp(). So it looks like the behavior >>>> change not a bug fix. >>> >>> Yeah, I am thinking in backpatching it. The documented behavior is >>> already not what the code does. >> >> Maybe you thought this because getRecordTimestamp() extracts the >> timestamp from even WAL record of a restore point? That is, you're >> concerned about that pg_last_xact_replay_timestamp() returns the >> timestamp of not only commit/abort record but also restore point one. >> Right? > > right. > >> As far as I read the code, this problem doesn't occur because >> SetLatestXTime() is called only for commit/abort records, in >> recoveryStopsAfter(). No? > > ... uh, wow, you're right about that too. IMO this is extremely > fragile, easy to break, and under-documented. Yeah, it's worth improving the code. > But you're right, there's > no bug there at present. > >>> Do you have a situation where this >>> change would break something? If so, can you please explain what it is? >> >> For example, use the return value of pg_last_xact_replay_timestamp() >> (and also the timestamp in the log message output at the end of >> recovery) as a HINT when setting recovery_target_time later. > > Hmm. > > I'm not sure how you would use it in that way. I mean, I understand how > it *can* be used that way, but it seems too fragile to be done in > practice, in a scenario that's not just laboratory games. > >> Use it to compare with the timestamp retrieved from the master server, >> in order to monitor the replication delay. > > That's precisely the use case that I'm aiming at. The timestamp > currently is not useful because this usage breaks when the primary is > inactive (no COMMIT records occur). During such periods of inactivity, > CHECKPOINT records would keep the "last xtime" current. This has > actually happened in a production setting, it's not a thought > experiment. I've heard that someone periodically generates dummy tiny transactions (say, every minute), as a band-aid solution, to avoid inactive primary. Of course, this is not a perfect solution. The idea that I proposed previously was to introduce pg_last_xact_insert_timestamp() [1] into core. This function returns the timestamp of commit / abort records in *primary* side. So we can retrieve that timestamp from the primary (e.g., by using dblink) and compare its result with pg_last_xact_replay_timestamp() to calculate the delay in the standby. Another idea is to include the commit / abort timestamp in primary-keepalive-message that periodially sent from the primary to the standby. Then if we introduce the function returning that timestamp, in the standby side, we can easily compare the commit / abort timestamps taken from both primary and standby, in the standby. [1] https://www.postgresql.org/message-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters