Re: Add Information during standby recovery conflicts
От | Drouvot, Bertrand |
---|---|
Тема | Re: Add Information during standby recovery conflicts |
Дата | |
Msg-id | 85c21e8e-4b12-9ff6-92bf-24fac2166632@amazon.com обсуждение исходный текст |
Ответ на | Re: Add Information during standby recovery conflicts (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>) |
Ответы |
Re: Add Information during standby recovery conflicts
(Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
|
Список | pgsql-hackers |
On 7/7/20 9:43 AM, Masahiko Sawada wrote: > Fully makes sense, the new patch version attached is now producing: >> 2020-07-06 06:10:36.022 UTC [14035] LOG: waiting for recovery conflict >> on snapshot > How about adding the subject? that is, "recovery is waiting for > recovery conflict on %s" or "recovery process <pid> is waiting for > conflict on %s". The subject is now added in the new attached patch (I did not include the pid as it is part of the log prefix). It now looks like: 2020-07-11 12:00:41.092 UTC [23217] LOG: recovery is waiting for recovery conflict on snapshot 2020-07-11 12:00:41.092 UTC [23217] DETAIL: There is 1 blocking connection(s). 2020-07-11 12:00:41.092 UTC [23217] CONTEXT: WAL redo at 0/4A0A6BF0 for Heap2/CLEAN: remxid 1128 WAL record received at 2020-07-11 12:00:41.092231+00 tbs 1663 db 13586 rel 16805, fork main, blkno 0 > >> 2020-07-06 06:10:36.022 UTC [14035] DETAIL: WAL record received at >> 2020-07-06 06:10:36.021963+00. >> Tablespace/database/relation are 1663/13586/16672, fork is main >> and block is 0. >> There is 1 blocking connection(s). > To follow the existing log message, perhaps this can be something like > "WAL record received at %s, rel %u/%u/%u, fork %s, blkno %u. %d > processes". But I'm not sure the errdetail is the best place to > display the WAL information as I mentioned in the latter part of this > email. moved to the context and formatted the same way as the current Standby/LOCK context. > Ok. Let's keep this thread for the new attached patch that focus on the >> recovery process waiting. > Thank you for updating the patch! > > I've tested the latest patch. Thank you for testing and reviewing! > On recovery conflict on lock and on > bufferpin, if max_standby_streaming_delay is disabled (set to -1), the > logs don't appear even if log_recovery_conflicts is true. Nice catch! it is fixed in the new attached patch (the log reporting has been moved out of StandbyTimeoutHandler()). > > Here is random comments on the code: > > + recovery_conflict_main_message = psprintf("waiting for > recovery conflict on %s", > + > get_procsignal_reason_desc(reason)); > : > + ereport(LOG, > + (errmsg("%s", recovery_conflict_main_message), > + errdetail("%s\n" "There is %d blocking > connection(s).", wal_record_detail_str, num_waitlist_entries))); > > It's not translation-support-friendly. I think the message "waiting > for recovery conflict on %s" should be surrounded by _(). Or we can > just put it to ereport as follows: > > ereport(LOG, > (errmsg("waiting for recovery conflicts on %s", > get_procsignal_reason_desc(reason)) > ... changed in the new attached patch. > > --- > + oldcontext = MemoryContextSwitchTo(ErrorContext); > + econtext = error_context_stack; > + > + if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum)) > > I don't think it's a good idea to rely on error_context_stack because > other codes might set another error context before reaching here in > the future. right, changed in the new attached patch: this is now done in rm_redo_error_callback() and using the XLogReaderState passed as argument. > > --- > + if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum)) > + wal_record_detail_str = psprintf("WAL record received > at %s.\nTablespace/database/relation are %u/%u/%u, fork is %s and > block is %u.", > + receipt_time_str, > rnode.spcNode, rnode.dbNode, rnode.relNode, > + forkNames[forknum], > + blknum); > > There might be a block tag in block ids other than 0. right, fixed in the new attached patch. > I'm not sure the > errdetail is the best place where we display WAL information. moved to context in the new attached patch. > For > instance, we display both the relation oid and block number depending > on RM as follows: > > 2020-07-07 15:50:30.733 JST [13344] LOG: waiting for recovery conflict on lock > 2020-07-07 15:50:30.733 JST [13344] DETAIL: WAL record received at > 2020-07-07 15:50:27.73378+09. > There is 1 blocking connection(s). > 2020-07-07 15:50:30.733 JST [13344] CONTEXT: WAL redo at 0/3000028 > for Standby/LOCK: xid 506 db 13586 rel 16384 > > I wonder if we can display the details of redo WAL information by > improving xlog_outdesc() or rm_redo_error_callback() so that it > displays relfilenode, forknum, and block number. What do you think? I think that fully make sense to move this to rm_redo_error_callback(). This is where the information is now been displayed in the new attached patch. > > --- > + /* display wal record information */ > + if (log_recovery_conflicts) > + LogBlockedWalRecordInfo(num_waitlist_entries, reason); > > I'm concerned that we should log the recovery conflict information > every time when a recovery conflict happens from the perspective of > overheads and the amount of the logs. Can we logs that information > after waiting for deadlock_timeouts secs like log_lock_waits or > waiting for the fixed duration? The new attached patch is now waiting for deadlock_timeout duration. > > --- > @@ -609,6 +682,10 @@ StandbyTimeoutHandler(void) > /* forget any pending STANDBY_DEADLOCK_TIMEOUT request */ > disable_timeout(STANDBY_DEADLOCK_TIMEOUT, false); > > + /* display wal record information */ > + if (log_recovery_conflicts) > + LogBlockedWalRecordInfo(-1, > PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); > + > SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); > } > > ResolveRecoveryConflictWithBufferPin() which sets a timer to call > StandbyTimeoutHandler() can be called multiple times even if the > recovery is waiting for one buffer pin. I think we should avoid > logging the same contents multiple times. I do agree, only the first pass is now been logged. > > --- > - > + { > + {"log_recovery_conflicts", PGC_SUSET, LOGGING_WHAT, > + gettext_noop("Logs standby recovery conflicts."), > + NULL > + }, > + &log_recovery_conflicts, > + true, > + NULL, NULL, NULL > + }, > > Other logging parameters such as log_lock_waits is false by default. I > think this parameter can also be false by default but is there any > reason to enable it by default? now set to false by default. > > Since this parameter applies only to the startup process, I think it > should be PGC_SIGHUP. changed that way. > > --- > + /* display wal record information */ > + if (log_recovery_conflicts) > + LogBlockedWalRecordInfo(CountDBBackends(dbid), > PROCSIG_RECOVERY_CONFLICT_DATABASE); > > We log the recovery conflict into the server log but we don't update > the process title to append "waiting". While discussing the process > title update on recovery conflict, I got the review comment[1] that we > don't need to update the process title because no wait occurs when > recovery conflict with database happens. As the comment says, recovery > is canceling the existing processes on the database being removed, but > not waiting for something. Ok, I keep reporting the conflict but changed the wording for this particular case. Thanks Bertrand
Вложения
В списке pgsql-hackers по дате отправления: