Re: Some problems of recovery conflict wait events
От | Masahiko Sawada |
---|---|
Тема | Re: Some problems of recovery conflict wait events |
Дата | |
Msg-id | CA+fd4k4JMU+SdF3Y8=b6-f7QqnUYwkWHuGDJ0018H75BqPE-aw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Some problems of recovery conflict wait events (Fujii Masao <masao.fujii@oss.nttdata.com>) |
Ответы |
Re: Some problems of recovery conflict wait events
(Fujii Masao <masao.fujii@oss.nttdata.com>)
|
Список | pgsql-hackers |
On Fri, 27 Mar 2020 at 10:32, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2020/03/26 14:33, Masahiko Sawada wrote: > > On Tue, 24 Mar 2020 at 17:04, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> > >> > >> On 2020/03/05 20:16, Fujii Masao wrote: > >>> > >>> > >>> On 2020/03/05 16:58, Masahiko Sawada wrote: > >>>> On Wed, 4 Mar 2020 at 15:21, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2020/03/04 14:31, Masahiko Sawada wrote: > >>>>>> On Wed, 4 Mar 2020 at 13:48, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 2020/03/04 13:27, Michael Paquier wrote: > >>>>>>>> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote: > >>>>>>>>> Yeah, so 0001 patch sets existing wait events to recovery conflict > >>>>>>>>> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION) > >>>>>>>>> to the recovery conflict on a snapshot. 0003 patch improves these wait > >>>>>>>>> events by adding the new type of wait event such as > >>>>>>>>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch > >>>>>>>>> is the fix for existing versions and 0003 patch is an improvement for > >>>>>>>>> only PG13. Did you mean even 0001 patch doesn't fit for back-patching? > >>>>>>> > >>>>>>> Yes, it looks like a improvement rather than bug fix. > >>>>>>> > >>>>>> > >>>>>> Okay, understand. > >>>>>> > >>>>>>>> I got my eyes on this patch set. The full patch set is in my opinion > >>>>>>>> just a set of improvements, and not bug fixes, so I would refrain from > >>>>>>>> back-backpatching. > >>>>>>> > >>>>>>> I think that the issue (i.e., "waiting" is reported twice needlessly > >>>>>>> in PS display) that 0002 patch tries to fix is a bug. So it should be > >>>>>>> fixed even in the back branches. > >>>>>> > >>>>>> So we need only two patches: one fixes process title issue and another > >>>>>> improve wait event. I've attached updated patches. > >>>>> > >>>>> Thanks for updating the patches! I started reading 0001 patch. > >>>> > >>>> Thank you for reviewing this patch. > >>>> > >>>>> > >>>>> - /* > >>>>> - * Report via ps if we have been waiting for more than 500 msec > >>>>> - * (should that be configurable?) > >>>>> - */ > >>>>> - if (update_process_title && new_status == NULL && > >>>>> - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(), > >>>>> - 500)) > >>>>> > >>>>> The patch changes ResolveRecoveryConflictWithSnapshot() and > >>>>> ResolveRecoveryConflictWithTablespace() so that they always add > >>>>> "waiting" into the PS display, whether wait is really necessary or not. > >>>>> But isn't it better to display "waiting" in PS basically when wait is > >>>>> necessary, like originally ResolveRecoveryConflictWithVirtualXIDs() > >>>>> does as the above? > >>>> > >>>> You're right. Will fix it. > >>>> > >>>>> > >>>>> ResolveRecoveryConflictWithDatabase(Oid dbid) > >>>>> { > >>>>> + char *new_status = NULL; > >>>>> + > >>>>> + /* Report via ps we are waiting */ > >>>>> + new_status = set_process_title_waiting(); > >>>>> > >>>>> In ResolveRecoveryConflictWithDatabase(), there seems no need to > >>>>> display "waiting" in PS because no wait occurs when recovery conflict > >>>>> with database happens. > >>>> > >>>> Isn't the startup process waiting for other backend to terminate? > >>> > >>> Yeah, you're right. I agree that "waiting" should be reported in this case. > >> > >> On second thought, in recovery conflict case, "waiting" should be reported > >> while waiting for the specified delay (e.g., by max_standby_streaming_delay) > >> until the conflict is resolved. So IMO reporting "waiting" in the case of > >> recovery conflict with buffer pin, snapshot, lock and tablespace seems valid, > >> because they are user-visible "expected" wait time. > >> > >> However, in the case of recovery conflict with database, the recovery > >> basically doesn't wait at all and just terminates the conflicting sessions > >> immediately. Then the recovery waits for all those sessions to be terminated, > >> but that wait time is basically small and should not be the user-visible. > >> If that wait time becomes very long because of unresponsive backend, ISTM > >> that LOG or WARNING should be logged instead of reporting something in > >> PS display. I'm not sure if that logging is really necessary now, though. > >> Therefore, I'm thinking that "waiting" doesn't need to be reported in the case > >> of recovery conflict with database. Thought? > > > > Fair enough. The longer wait time of conflicts with database is not > > user-expected behaviour so logging would be better. I'd like to just > > drop the change around ResolveRecoveryConflictWithDatabase() from the > > patch. > > Make sense. > > + if (update_process_title) > + waitStart = GetCurrentTimestamp(); > > Since LockBufferForCleanup() can be called very frequently, > I don't think that it's good thing to call GetCurrentTimestamp() > every time when LockBufferForCleanup() is called. > > + /* Report via ps if we have been waiting for more than 500 msec */ > + if (update_process_title && new_status == NULL && > + TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(), > + 500)) > > Do we really want to see "waiting" in PS even in non hot standby mode? > > If max_standby_streaming_delay and deadlock_timeout are set to large value, > ResolveRecoveryConflictWithBufferPin() can wait for a long time, e.g., > more than 500ms. In that case, I'm afraid that "report if we've been > waiting for more than 500ms" logic doesn't work. > > So I'm now thinking that it's better to report "waiting" immdiately before > ResolveRecoveryConflictWithBufferPin(). Of course, we can still use > "report if we've been waiting for more than 500ms" logic by teaching 500ms > to ResolveRecoveryConflictWithBufferPin() as the minimum wait time. > But this looks overkill. Thought? > > Based on the above comments, I updated the patch. Attached. Right now > the patch looks very simple. Could you review this patch? Thank you for the patch. I agree with you for all the points. Your patch looks good to me. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: