RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Дата | |
Msg-id | TY4PR01MB16907BBAFB73A4022CB4121089431A@TY4PR01MB16907.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | RE: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Ответы |
Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
|
Список | pgsql-bugs |
> On Monday, August 18, 2025 4:12 PM Zane Duffield > <duffieldzane@gmail.com> wrote: > > On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu) > > <mailto:houzj.fnst@fujitsu.com> wrote: > > > Thank you for reporting the issue. I reviewed the logs and query > > > results but did not find any evidence of a suspicious strong lock. > > > Therefore, if possible, could you please query pg_lock and pg_stat_activity > separately? > > > This will allow us to determine whether there are any other wait > > > events, besides strong locks, that might be causing blocking. > > > > We tried reproducing the issue in a test environment and were not > > completely successful; we can reproduce the unusual spikes in > > replication lag (which correspond to restarts of the apply worker), > > but we cannot reproduce the stalled replication that we saw in production > once. > > > > I suspect that these issues are related, and that the stalling is > > simply a more severe case of the lagging, so I hope that the extra > > logs I have collected will help diagnose the problem. Attached is a > > tarball (~17MB, with file extension _tgz because gmail wouldn't let me > upload a .tgz) which contains a few files: > > 1. confirmed_flush_lsn_lag_graph_2025_08_09.png: a graph of the logical > > replication lag during the original incident, where replication falls behind > > and does not recover until we intervene and alter the subscription to > disable > > parallel apply > > 2. confirmed_flush_lsn_lag_graph_2025_08_13.png: the same graph, but > during an > > attempt to recreate the issue > > 3. server.log: the subscriber's postgres server log during the attempt to > > recreate the issue > > 4. pg_stat_activity.log: a log of a periodic query of all of pg_stat_activity > > during the attempt to recreate the issue 5. pg_locks.log: a log of > > a periodic query of all of pg_locks during the attempt to recreate the > > issue Please note that the graphs are in UTC, but the logs are in > > UTC+1. Also note that some information has been redacted (server > > names, database identifiers, IP addresses). > > > > Another thing worth noting is that in the first graph, the blue and > > yellow lines correspond to two other subscribers that were configured with > "streaming = off". > > You can clearly see that for the period where the green line > > ("streaming = > > parallel") is going up and down, the other subscribers have a > > practically flat line, only going up at the end before returning to zero. > > Thanks for sharing the logs! > > I noticed in the logs[1] that both the parallel apply worker and leader apply > worker are restarting at certain intervals, which could be the cause of the lag. > This might be due to a SIGINT triggered by a lock_timeout or statement_timeout, > although it's a bit weried that there are no timeout messages present in the logs. > If my assumption is correct, the behavior is understandable: the parallel apply > worker waits for the leader to send more data for the streamed transaction by > acquiring and waiting on a lock. However, the leader might be occupied with > other transactions, preventing it from sending additional data, which could > potentially lead to a lock timeout. > > To confirm this, could you please provide the values you have set for > lock_timeout, statement_timeout (on subscriber), and > logical_decoding_work_mem (on publisher) ? > > Additionally, for testing purposes, is it possible to disable these timeouts (by > setting the lock_timeout and statement_timeout GUCs to their default values) > in your testing environment to assess whether the lag still persists? This > approach can help us determine whether the timeouts are causing the lag. > > [1]--- > Aug 13 03:22:36.259 [2242351]: [952-1] LOG: logical replication parallel apply > worker for subscription "sub1" has finished > --- By the way, if possible, could you share a script that can reproduce the same log? It will be very helpful for us to debug it. Best Regards, Hou zj
В списке pgsql-bugs по дате отправления: