Re: Add Information during standby recovery conflicts

Поиск
Список
Период
Сортировка
От Drouvot, Bertrand
Тема Re: Add Information during standby recovery conflicts
Дата
Msg-id daeecc9d-567a-cad8-298a-6f12ed7cb665@amazon.com
обсуждение исходный текст
Ответ на Re: Add Information during standby recovery conflicts  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Ответы Re: Add Information during standby recovery conflicts  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Список pgsql-hackers
Hey,

On 6/29/20 10:55 AM, Masahiko Sawada wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>> Hi hackers,
>>
>> I've attached a patch to add information during standby recovery conflicts.
>> The motive behind is to be able to get some information:
>>
>> On the apply side
>> On the blocker(s) side
>>
>> Motivation:
>>
>> When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the
rootcause and find a way to avoid/mitigate new occurrences.
 
> I think this is a good feature. Like log_lock_waits, it will help the
> users to investigate recovery conflict issues.
exactly, thanks for looking at it.
>
>> Adding this information would make the investigations easier, it could help answering questions like:
>>
>> On which LSN was the WAL apply blocked?
>> What was the purpose of the bocked WAL record?
>> On which relation (if any) was the blocked WAL record related to?
>> What was the blocker(s) doing?
>> When did the blocker(s) started their queries (if any)?
>> What was the blocker(s) waiting for? on which wait event?
>>
>> Technical context and proposal:
>>
>> There is 2 points in this patch:
>>
>> Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs,
ResolveRecoveryConflictWithDatabase,StandbyTimeoutHandler)
 
> I think we already have the information about the WAL record being
> applied in errcontext.
right, but it won’t be displayed in case log_error_verbosity is set to 
terse.
Or did you had in mind to try to avoid using the new 
“current_replayed_xlog” in xlog.c?

>
> I wonder if we can show the recovery conflict information in the main
> LOG message, the blocker information in errdetail, and use errcontext
> with regard to WAL record information. For example:
>
> LOG:  process 500 waiting for recovery conflict on snapshot
> DETAIL:  conflicting transition id: 100, 200, 300
> CONTEXT:  WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506
Not sure at all if it would be possible to put all the information at 
the same time.
For example in case of shared buffer pin conflict, the blocker is 
currently known “only” during the RecoveryConflictInterrupt call (so 
still not known yet when we can “already” report the blocked LSN 
information).
It might also happen that the blocker(s) will never get interrupted (was 
not blocking anymore before max_standby_streaming_delay has been 
reached): then it would not be possible to display all the information 
here (aka when it is interrupted) while we still want to be warned that 
the WAL replay has been blocked.
>
>> Outcome Example:
>>
>> 2020-06-15 06:48:23.774 UTC [6971] LOG:  wal record apply is blocked by 2 connection(s), reason: User query might
haveneeded to see row versions that must be removed.
 
>> 2020-06-15 06:48:23.774 UTC [6971] LOG:  blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15
06:48:23.774098+00,desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0
 
>> 2020-06-15 06:48:54.773 UTC [7088] LOG:  about to interrupt pid: 7088, backend_type: client backend, state: active,
wait_event_type:Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00
 
> I'm concerned that the above information on the process who is about
> to be interrupted is very detailed but I'm not sure it will be helpful
> for the users. If the blocker is waiting on something lock, the
> information should be logged by log_lock_waits.

The blocker could also be in “idle in transaction”, say for example, on 
the standby (with hot_standby_feedback set to off):

standby> begin;
standby> SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
standby> select * from bdt;

on the master:

master> update bdt set generate_series = 15;
master> vacuum verbose bdt;

would produce:

2020-07-01 09:18:55.256 UTC [32751] LOG:  about to interrupt pid: 32751, 
backend_type: client backend, state: idle in transaction, 
wait_event_type: Client, wait_event: ClientRead, query_start: 2020-07-01 
09:18:17.390572+00
2020-07-01 09:18:55.256 UTC [32751] DETAIL:  statement: select * from bdt;

I think those information are useful to have (might get rid of 
wait_event_type though: done in the new attached patch).

> Also the canceled
> backend will emit the ERROR log with the message "canceling statement
> due to conflict with recovery”, and its pid can be logged at the log
> prefix. The user can know who has been canceled by recovery conflicts
> and the query.

Right, we can also get rid of the pid and the query in the new log 
message too (done in the new attached patch).

> Probably we need to consider having a time threshold (or boolean to
> turn on/off) to emit this information to the server logs. Otherwise,
> we will end up writing every conflict information, making the log
> dirty needlessly.

good point, I do agree (done in the new attached patch).

Bertrand

>
> Regards,
>
> --
> Masahiko Sawada            http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amul Sul
Дата:
Сообщение: Re: Cleanup - adjust the code crossing 80-column window limit
Следующее
От: torikoshia
Дата:
Сообщение: Re: Creating a function for exposing memory usage of backend process