Hi hackers,
I've attached a patch to add information during standby recovery conflicts.
The motive behind is to be able to get some information:
- On the apply side
- On the blocker(s) side
Motivation: When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the root cause and find a way to avoid/mitigate new occurrences.
Adding this information would make the investigations easier, it could help answering questions like:
- On which LSN was the WAL apply blocked?
- What was the purpose of the bocked WAL record?
- On which relation (if any) was the blocked WAL record related to?
- What was the blocker(s) doing?
- When did the blocker(s) started their queries (if any)?
- What was the blocker(s) waiting for? on which wait event?
Technical context and proposal: There is 2 points in this patch:
- Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase, StandbyTimeoutHandler)
- Add the information about the blocker(s). This is done in postgres.c (RecoveryConflictInterrupt)
Outcome Example:2020-06-15 06:48:23.774 UTC [6971] LOG: wal record apply is blocked by 2 connection(s), reason: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:23.774 UTC [6971] LOG: blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00, desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0
2020-06-15 06:48:54.773 UTC [7088] LOG: about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00
2020-06-15 06:48:54.773 UTC [7088] DETAIL: statement: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.773 UTC [7088] STATEMENT: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.773 UTC [7088] ERROR: canceling statement due to conflict with recovery
2020-06-15 06:48:54.773 UTC [7088] DETAIL: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.773 UTC [7088] STATEMENT: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.778 UTC [7037] LOG: about to interrupt pid: 7037, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:13.008427+00
2020-06-15 06:48:54.778 UTC [7037] DETAIL: statement: select *, pg_sleep(300) from bdt;
2020-06-15 06:48:54.778 UTC [7037] STATEMENT: select *, pg_sleep(300) from bdt;
2020-06-15 06:48:54.778 UTC [7037] ERROR: canceling statement due to conflict with recovery
2020-06-15 06:48:54.778 UTC [7037] DETAIL: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.778 UTC [7037] STATEMENT: select *, pg_sleep(300) from bdt;
It takes care of the other conflicts reason too.
So, should a standby recovery conflict occurs, we could see:
- information about the blocked WAL record apply
- information about the blocker(s)
I will add this patch to the next commitfest. I look forward to your feedback about the idea and/or implementation.
Regards,
Bertrand