Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming
От | Amit Kapila |
---|---|
Тема | Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming |
Дата | |
Msg-id | CAA4eK1Jy5BwZmr5Sp50Q5C+jAbfRfS_e3tFyuzyyJd6CYtgncw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Lock timeouts and unusual spikes in replication lag with logical parallel transaction streaming (Zane Duffield <duffieldzane@gmail.com>) |
Список | pgsql-bugs |
On Wed, Aug 20, 2025 at 11:08 AM Zane Duffield <duffieldzane@gmail.com> wrote: >> >> > On Monday, August 18, 2025 4:12 PM Zane Duffield >> > <duffieldzane@gmail.com> wrote: >> > > On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu) >> > > <mailto:houzj.fnst@fujitsu.com> wrote: > > > Yes, I think it is the cause of the lag (every peak lines up directly with a restart of the apply workers), but I'm notsure how it relates to the complete stall shown in confirmed_flush_lsn_lag_graph_2025_08_09.png (attached again). > >> >> > This might be due to a SIGINT triggered by a lock_timeout or statement_timeout, >> > although it's a bit weried that there are no timeout messages present in the logs. >> > If my assumption is correct, the behavior is understandable: the parallel apply >> > worker waits for the leader to send more data for the streamed transaction by >> > acquiring and waiting on a lock. However, the leader might be occupied with >> > other transactions, preventing it from sending additional data, which could >> > potentially lead to a lock timeout. >> > >> > To confirm this, could you please provide the values you have set for >> > lock_timeout, statement_timeout (on subscriber), and >> > logical_decoding_work_mem (on publisher) ? > > > lock_timeout = 30s > statement_timeout = 4h > logical_decoding_work_mem = 64MB > >> >> > >> > Additionally, for testing purposes, is it possible to disable these timeouts (by >> > setting the lock_timeout and statement_timeout GUCs to their default values) >> > in your testing environment to assess whether the lag still persists? This >> > approach can help us determine whether the timeouts are causing the lag. > > > This was a good question. See the attached confirmed_flush_lsn_lag_graph_2025_08_19.png. > After setting lock_timeout to zero, the periodic peaks of lag were eliminated, and the restarts of the apply workers inthe log are also eliminated. > So, this was the reason. As explained by Hou-San, in his previous response, such a lock_timeout can lead to parallel apply worker exit while waiting for more data from the leader. I think you need to either set lock_timeout as 0 or set it to a higher value similar to what you set for statement_timeout. > > One other thing I wonder is whether autovacuum on the subscriber has anything to do with the lock timeouts. I'm not surewhether this could explain the perpetually-restarting apply workers that we witnessed on 2025-08-09, though. > No, as per my understanding it is because parallel apply worker exiting due to lock_timeout set in the test. Ideally, the patch proposed by Kuroda-San should show in LOGs that the parallel worker is exiting due to lock_timeout. Can you try that once? -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: