Re: Add Information during standby recovery conflicts

Поиск
Список
Период
Сортировка
От Drouvot, Bertrand
Тема Re: Add Information during standby recovery conflicts
Дата
Msg-id 85c21e8e-4b12-9ff6-92bf-24fac2166632@amazon.com
обсуждение исходный текст
Ответ на Re: Add Information during standby recovery conflicts  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Ответы Re: Add Information during standby recovery conflicts  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Список pgsql-hackers
On 7/7/20 9:43 AM, Masahiko Sawada wrote:
> Fully makes sense, the new patch version attached is now producing:
>> 2020-07-06 06:10:36.022 UTC [14035] LOG:  waiting for recovery conflict
>> on snapshot
> How about adding the subject? that is, "recovery is waiting for
> recovery conflict on %s" or "recovery process <pid> is waiting for
> conflict on %s".


The subject is now added in the new attached patch (I did not include 
the pid as it is part of the log prefix).

It now looks like:

2020-07-11 12:00:41.092 UTC [23217] LOG:  recovery is waiting for 
recovery conflict on snapshot
2020-07-11 12:00:41.092 UTC [23217] DETAIL:  There is 1 blocking 
connection(s).
2020-07-11 12:00:41.092 UTC [23217] CONTEXT:  WAL redo at 0/4A0A6BF0 for 
Heap2/CLEAN: remxid 1128
         WAL record received at 2020-07-11 12:00:41.092231+00
         tbs 1663 db 13586 rel 16805, fork main, blkno 0
>
>> 2020-07-06 06:10:36.022 UTC [14035] DETAIL:  WAL record received at
>> 2020-07-06 06:10:36.021963+00.
>>           Tablespace/database/relation are 1663/13586/16672, fork is main
>> and block is 0.
>>           There is 1 blocking connection(s).
> To follow the existing log message, perhaps this can be something like
> "WAL record received at %s, rel %u/%u/%u, fork %s, blkno %u. %d
> processes". But I'm not sure the errdetail is the best place to
> display the WAL information as I mentioned in the latter part of this
> email.

moved to the context and formatted the same way as the current 
Standby/LOCK context.


> Ok. Let's keep this thread for the new attached patch that focus on the
>> recovery process waiting.
> Thank you for updating the patch!
>
> I've tested the latest patch.


Thank you for testing and reviewing!


> On recovery conflict on lock and on
> bufferpin, if max_standby_streaming_delay is disabled (set to -1), the
> logs don't appear even if log_recovery_conflicts is true.


Nice catch! it is fixed in the new attached patch (the log reporting has 
been moved out of StandbyTimeoutHandler()).

>
> Here is random comments on the code:
>
> +       recovery_conflict_main_message = psprintf("waiting for
> recovery conflict on %s",
> +
> get_procsignal_reason_desc(reason));
> :
> +               ereport(LOG,
> +                               (errmsg("%s", recovery_conflict_main_message),
> +                               errdetail("%s\n" "There is %d blocking
> connection(s).", wal_record_detail_str, num_waitlist_entries)));
>
> It's not translation-support-friendly. I think the message "waiting
> for recovery conflict on %s" should be surrounded by _(). Or we can
> just put it to ereport as follows:
>
> ereport(LOG,
>      (errmsg("waiting for recovery conflicts on %s",
> get_procsignal_reason_desc(reason))
>      ...


changed in the new attached patch.


>
> ---
> +       oldcontext = MemoryContextSwitchTo(ErrorContext);
> +       econtext = error_context_stack;
> +
> +       if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum))
>
> I don't think it's a good idea to rely on error_context_stack because
> other codes might set another error context before reaching here in
> the future.


right, changed in the new attached patch: this is now done in 
rm_redo_error_callback() and using the XLogReaderState passed as argument.


>
> ---
> +       if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum))
> +               wal_record_detail_str = psprintf("WAL record received
> at %s.\nTablespace/database/relation are %u/%u/%u, fork is %s and
> block is %u.",
> +                                               receipt_time_str,
> rnode.spcNode, rnode.dbNode, rnode.relNode,
> +                                               forkNames[forknum],
> +                                               blknum);
>
> There might be a block tag in block ids other than 0.


right, fixed in the new attached patch.


> I'm not sure the
> errdetail is the best place where we display WAL information.


moved to context in the new attached patch.


> For
> instance, we display both the relation oid and block number depending
> on RM as follows:
>
> 2020-07-07 15:50:30.733 JST [13344] LOG:  waiting for recovery conflict on lock
> 2020-07-07 15:50:30.733 JST [13344] DETAIL:  WAL record received at
> 2020-07-07 15:50:27.73378+09.
>          There is 1 blocking connection(s).
> 2020-07-07 15:50:30.733 JST [13344] CONTEXT:  WAL redo at 0/3000028
> for Standby/LOCK: xid 506 db 13586 rel 16384
>
> I wonder if we can display the details of redo WAL information by
> improving xlog_outdesc() or rm_redo_error_callback() so that it
> displays relfilenode, forknum, and block number. What do you think?


I think that fully make sense to move this to rm_redo_error_callback().

This is where the information is now been displayed in the new attached 
patch.


>
> ---
> +       /* display wal record information */
> +       if (log_recovery_conflicts)
> +               LogBlockedWalRecordInfo(num_waitlist_entries, reason);
>
> I'm concerned that we should log the recovery conflict information
> every time when a recovery conflict happens from the perspective of
> overheads and the amount of the logs. Can we logs that information
> after waiting for deadlock_timeouts secs like log_lock_waits or
> waiting for the fixed duration?


The new attached patch is now waiting for deadlock_timeout duration.

>
> ---
> @@ -609,6 +682,10 @@ StandbyTimeoutHandler(void)
>          /* forget any pending STANDBY_DEADLOCK_TIMEOUT request */
>          disable_timeout(STANDBY_DEADLOCK_TIMEOUT, false);
>
> +       /* display wal record information */
> +       if (log_recovery_conflicts)
> +               LogBlockedWalRecordInfo(-1,
> PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
> +
>          SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
>   }
>
> ResolveRecoveryConflictWithBufferPin() which sets a timer to call
> StandbyTimeoutHandler() can be called multiple times even if the
> recovery is waiting for one buffer pin. I think we should avoid
> logging the same contents multiple times.


I do agree, only the first pass is now been logged.


>
> ---
> -
> +       {
> +               {"log_recovery_conflicts", PGC_SUSET, LOGGING_WHAT,
> +                       gettext_noop("Logs standby recovery conflicts."),
> +                       NULL
> +               },
> +               &log_recovery_conflicts,
> +               true,
> +               NULL, NULL, NULL
> +       },
>
> Other logging parameters such as log_lock_waits is false by default. I
> think this parameter can also be false by default but is there any
> reason to enable it by default?


now set to false by default.


>
> Since this parameter applies only to the startup process, I think it
> should be PGC_SIGHUP.


changed that way.


>
> ---
> +       /* display wal record information */
> +       if (log_recovery_conflicts)
> +               LogBlockedWalRecordInfo(CountDBBackends(dbid),
> PROCSIG_RECOVERY_CONFLICT_DATABASE);
>
> We log the recovery conflict into the server log but we don't update
> the process title to append "waiting". While discussing the process
> title update on recovery conflict, I got the review comment[1] that we
> don't need to update the process title because no wait occurs when
> recovery conflict with database happens. As the comment says, recovery
> is canceling the existing processes on the database being removed, but
> not waiting for something.


Ok, I keep reporting the conflict but changed the wording for this 
particular case.

Thanks

Bertrand


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: INSERT INTO SELECT, Why Parallelism is not selected?
Следующее
От: vignesh C
Дата:
Сообщение: Re: Include access method in listTables output