Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls
| От | Chao Li |
|---|---|
| Тема | Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls |
| Дата | |
| Msg-id | ADD206E1-20CE-46A5-A0E6-4EF763F8E9BF@gmail.com обсуждение исходный текст |
| Ответ на | Fix lag columns in pg_stat_replication not advancing when replay LSN stalls (Fujii Masao <masao.fujii@gmail.com>) |
| Ответы |
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls
|
| Список | pgsql-hackers |
> On Oct 17, 2025, at 11:56, Fujii Masao <masao.fujii@gmail.com> wrote: > > Hi, > > While testing, I noticed that write_lag and flush_lag in pg_stat_replication > initially advanced but eventually stopped updating. This happened when > I started pg_receivewal, ran pgbench, and periodically monitored > pg_stat_replication. > > My analysis shows that this issue occurs when any of the write, flush, > or replay LSNs in the standby’s feedback message stop updating for some time. > In the case of pg_receivewal, the replay LSN is always invalid (never updated), > which triggers the problem. Similarly, in regular streaming replication, > if the replay LSN remains unchanged for a long time—such as during > a recovery conflict—the lag values for both write and flush can stop advancing. > > The root cause seems to be that when any of the LSNs stop updating, > the lag tracker's cyclic buffer becomes full (the write head reaches > the slowest read head). In this situation, LagTrackerWrite() and > LagTrackerRead() didn't handle the full-buffer condition properly. > For instance, if the replay LSN stalls, the buffer fills up and the read heads > for "write" and "flush" end up at the same position as the write head. > This causes LagTrackerRead() to return -1 for both, preventing write_lag > and flush_lag from advancing. > > The attached patch fixes the problem by treating the slowest read entry > (the one causing the buffer to fill up) as a separate overflow entry, > allowing the lag tracker to continue operating correctly. > > -- > Fujii Masao > <v1-0001-Fix-lag-columns-in-pg_stat_replication-not-advanc.patch> It took me some time to understand this fix. My most confusing was that once overwrite happens, how a reader head to catchup again? Finally I figured it out: ``` + lag_tracker->read_heads[head] = + (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE; ``` "(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE” points to the oldest LSN in the ring, from where an overflowedreader head starts to catch up. I have no comment on the code change. Nice patch! All I wonder is if we can add a TAP test for this fix? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
В списке pgsql-hackers по дате отправления: