At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote:
> > It is been difficult to get a generic repro, but the way we reproduce
> > Is through our test suite. To give more details, we are running tests
> > In which we constantly failover and promote standbys. The issue
> > surfaces after we have gone through a few promotions which occur
> > every few hours or so ( not really important but to give context ).
>
> Hmm. Could you describe exactly the failover scenario you are using?
> Is the test using a set of cascading standbys linked to the promoted
> one? Are the standbys recycled from the promoted nodes with pg_rewind
> or created from scratch with a new base backup taken from the
> freshly-promoted primary? I have been looking more at this thread
> through the day but I don't see a remaining issue. It could be
> perfectly possible that we are missing a piece related to the handling
> of those new overwrite contrecords in some cases, like in a rewind.
>
> > I am adding some additional debugging to see if I can draw a better
> > picture of what is happening. Will also give aborted_contrec_reset_3.patch
> > a go, although I suspect it will not handle the specific case we are deaing with.
>
> Yeah, this is not going to change much things if you are still seeing
> an issue. This patch does not change the logic, aka it just
True. That is a siginicant hint on what happened at the time.
- Are there only two hosts in the replication set? I concerned on
whether it is a cascading set or not.
- Exactly what are you performing at every failover? Especially do
the steps contain pg_rewind, and do you copy pg_wal and/or archive
files between the failover hosts?
> simplifies the tracking of the continuation record data, resetting it
> when a complete record has been read. Saying that, getting rid of the
> dependency on StandbyMode because we cannot promote in the middle of a
> record is nice (my memories around that were a bit blurry but even
> recovery_target_lsn would not recover in the middle of an continuation
> record), and this is not bug so there is limited reason to backpatch
> this part of the change.
Agreed. In the first place my "repro" (or the test case) is a bit too
intricated to happen in the real field.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center