Add Information during standby recovery conflicts

Поиск
Список
Период
Сортировка
От Drouvot, Bertrand
Тема Add Information during standby recovery conflicts
Дата
Msg-id 9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
обсуждение исходный текст
Ответы Re: Add Information during standby recovery conflicts  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Список pgsql-hackers

Hi hackers,

I've attached a patch to add information during standby recovery conflicts.
The motive behind is to be able to get some information:

  • On the apply side
  • On the blocker(s) side
Motivation:

When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the root cause and find a way to avoid/mitigate new occurrences.

Adding this information would make the investigations easier, it could help answering questions like:
  • On which LSN was the WAL apply blocked?
  • What was the purpose of the bocked WAL record?
  • On which relation (if any) was the blocked WAL record related to?
  • What was the blocker(s) doing?
  • When did the blocker(s) started their queries (if any)?
  • What was the blocker(s) waiting for? on which wait event?
Technical context and proposal:

There is 2 points in this patch:
  • Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase, StandbyTimeoutHandler)
  • Add the information about the blocker(s). This is done in postgres.c (RecoveryConflictInterrupt)
Outcome Example:
2020-06-15 06:48:23.774 UTC [6971] LOG:  wal record apply is blocked by 2 connection(s), reason: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:23.774 UTC [6971] LOG:  blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00, desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0
2020-06-15 06:48:54.773 UTC [7088] LOG:  about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00
2020-06-15 06:48:54.773 UTC [7088] DETAIL:  statement: select *, pg_sleep(120)  from bdt;
2020-06-15 06:48:54.773 UTC [7088] STATEMENT:  select *, pg_sleep(120)  from bdt;
2020-06-15 06:48:54.773 UTC [7088] ERROR:  canceling statement due to conflict with recovery
2020-06-15 06:48:54.773 UTC [7088] DETAIL:  User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.773 UTC [7088] STATEMENT:  select *, pg_sleep(120)  from bdt;
2020-06-15 06:48:54.778 UTC [7037] LOG:  about to interrupt pid: 7037, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:13.008427+00
2020-06-15 06:48:54.778 UTC [7037] DETAIL:  statement: select *, pg_sleep(300)  from bdt;
2020-06-15 06:48:54.778 UTC [7037] STATEMENT:  select *, pg_sleep(300)  from bdt;
2020-06-15 06:48:54.778 UTC [7037] ERROR:  canceling statement due to conflict with recovery
2020-06-15 06:48:54.778 UTC [7037] DETAIL:  User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.778 UTC [7037] STATEMENT:  select *, pg_sleep(300)  from bdt;

It takes care of the other conflicts reason too.
So, should a standby recovery conflict occurs, we could see:

  • information about the blocked WAL record apply
  • information about the blocker(s)
I will add this patch to the next commitfest. I look forward to your feedback about the idea and/or implementation.

Regards,

Bertrand

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: factorial of negative numbers
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Review for GetWALAvailability()