On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>>
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>>
>> I forgot to mention that I'll wait until 14.0 is tagged before getting anything
>> pushed.
Hi Alvaro, sorry for being late to the party, but to add some reassurance that v2-commited-fix this really fixes solves
theinitial production problem, I've done limited test on it (just like with the v1-patch idea earlier/ with using
wal_keep_segments,wal_init_zero=on, archive_mode=on and archive_command='/bin/true')
- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 tries and I've got: 2x "invalid contrecord
length",1x "there is no contrecord flag" on standby.
- On soon-to-be-become-12.9 REL_12_STABLE (with commit 1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries,
I'vegot beautiful insight into what happened:
LOG: started streaming WAL from primary at 1/EC000000 on timeline 1
LOG: sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 2021-10-13 11:22:37.48305+00
CONTEXT: WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was tested too (m -> s1 -> s11) and both
{s1,s11}did behave properly and log the above message. Also additional check proved that after simulating ENOSPC crash
onmaster the data contents were identical everywhere (m1=s1=s11).
Thank you Alvaro and also to everybody else who participated in solving this challenging and really edge-case nasty
bug.
-J.