Обсуждение: Outdated replication protocol error?

Поиск
Список
Период
Сортировка

Outdated replication protocol error?

От
Jeff Davis
Дата:
Commit 5ee2197767 (about 4 years old) introduced the error:

  "IDENTIFY_SYSTEM has not been run before START_REPLICATION"

But it seems like running START_REPLICATION first works (at least in
the simple case).

We should either:

1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or

2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).

Regards,
    Jeff Davis






Re: Outdated replication protocol error?

От
Fujii Masao
Дата:

On 2021/01/12 9:06, Jeff Davis wrote:
> Commit 5ee2197767 (about 4 years old) introduced the error:
> 
>    "IDENTIFY_SYSTEM has not been run before START_REPLICATION"
> 
> But it seems like running START_REPLICATION first works (at least in
> the simple case).
> 
> We should either:
> 
> 1. Document that IDENTIFY_SYSTEM must always be run before
> START_REPLICATION, and always issue a WARNING if that's not done (an
> ERROR might break existing applications); or
> 
> 2. If the commit is out of date and no longer needed, or if it's easy
> enough to fix, just remove the error (and Assert a valid
> ThisTimeLineID).

+1 to remove the error if START_REPLICATION can always work fine without
IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
if IDENTIFY_SYSTEM is actually necessary even in that case.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Outdated replication protocol error?

От
Andres Freund
Дата:
Hi,

On 2021-01-14 16:40:26 +0900, Fujii Masao wrote:
> On 2021/01/12 9:06, Jeff Davis wrote:
> > Commit 5ee2197767 (about 4 years old) introduced the error:
> > 
> >    "IDENTIFY_SYSTEM has not been run before START_REPLICATION"
> > 
> > But it seems like running START_REPLICATION first works (at least in
> > the simple case).
> > 
> > We should either:
> > 
> > 1. Document that IDENTIFY_SYSTEM must always be run before
> > START_REPLICATION, and always issue a WARNING if that's not done (an
> > ERROR might break existing applications); or
> > 
> > 2. If the commit is out of date and no longer needed, or if it's easy
> > enough to fix, just remove the error (and Assert a valid
> > ThisTimeLineID).
> 
> +1 to remove the error if START_REPLICATION can always work fine without
> IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
> and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
> if IDENTIFY_SYSTEM is actually necessary even in that case.

The current approach seems quite bad to me too. By that point the value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.

Has anybody dug out the thread leading to the commit?

Greetings,

Andres Freund



Re: Outdated replication protocol error?

От
Jeff Davis
Дата:
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:
> Has anybody dug out the thread leading to the commit?


https://www.postgresql.org/message-id/CAMsr%2BYEN04ztb%2BSDb_UfD72Kg5M3F%2BCpad31QBKT2rRjysmxRg%40mail.gmail.com

Regards,
    Jeff Davis





Re: Outdated replication protocol error?

От
Jeff Davis
Дата:
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:
> > > We should either:
> > > 
> > > 1. Document that IDENTIFY_SYSTEM must always be run before
> > > START_REPLICATION, and always issue a WARNING if that's not done
> > > (an
> > > ERROR might break existing applications); or
> > > 
> > > 2. If the commit is out of date and no longer needed, or if it's
> > > easy
> > > enough to fix, just remove the error (and Assert a valid
> > > ThisTimeLineID).
> > 
> > +1 to remove the error if START_REPLICATION can always work fine
> > without
> > IDENTIFY_SYSTEM. I found that the error happens when we connect to
> > the standby
> > and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not
> > sure
> > if IDENTIFY_SYSTEM is actually necessary even in that case.
> 
> The current approach seems quite bad to me too. By that point the
> value
> could be just about arbitrarily out of date - but that doesn't really
> matter because GetStandbyFlushRecPtr() updates it. And for the
> !am_cascading_walsender it's trivial to compute.

[ digging up old thread ]

It seems everyone agrees that the current behavior is strange. Any
ideas on a solution here?

> Has anybody dug out the thread leading to the commit?


https://www.postgresql.org/message-id/CAMsr%2BYEN04ztb%2BSDb_UfD72Kg5M3F%2BCpad31QBKT2rRjysmxRg%40mail.gmail.com

Regards,
    Jeff Davis





Re: Outdated replication protocol error?

От
Andres Freund
Дата:
Hi,

On 2021-06-16 15:59:11 -0700, Jeff Davis wrote:
> [ digging up old thread ]
> 
> It seems everyone agrees that the current behavior is strange.

Yea. I don't remember the details, but I've also hit this problem since
in some odd circumstance while reviewing the "logical decoding on
standbys" patchset.


> Any ideas on a solution here?

I think we should explicitly compute the current timeline before using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.

Greetings,

Andres Freund



Re: Outdated replication protocol error?

От
Jeff Davis
Дата:
On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:
> I think we should explicitly compute the current timeline before
> using
> ThisTimelineID. E.g. in StartReplication() call a new version of
> GetFlushRecPtr() that also returns the current timeline id.

I think all we need to do is follow the pattern in IdentifySystem() by
calling:

    am_cascading_walsender = RecoveryInProgress();

first. There are three cases:

1. If the server was a primary the last time RecoveryInProgress() was
called, and it's still a primary, then it returns false immediately.
ThisTimeLineID should be set properly already.

2. If the server was a secondary the last time RecoveryInProgress() was
called, and now it's a primary, then it updates ThisTimeLineID in
InitXLOGAccess() and returns false.

3. If the server was a secondary the last time, and is still a
secondary, then it returns true. Then, StartReplication() will call
GetStandbyFlushRecPtr(), which will update ThisTimeLineID.

It was confusing to me for a while because I was trying to sort out
whether am_cascading_walsender and/or ThisTimeLineID could be out of
date, and how that would result in not updating ThisTimeLineID; and
also why there was a difference between IdentifySystem() and
StartReplication().

Simple patch attached. I didn't test it yet, but wanted to post my
analysis.

Regards,
    Jeff Davis


Вложения

Re: Outdated replication protocol error?

От
Andres Freund
Дата:
Hi,

On 2021-06-17 18:13:57 -0700, Jeff Davis wrote:
> On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:
> > I think we should explicitly compute the current timeline before
> > using
> > ThisTimelineID. E.g. in StartReplication() call a new version of
> > GetFlushRecPtr() that also returns the current timeline id.
> 
> I think all we need to do is follow the pattern in IdentifySystem() by
> calling:
> 
>     am_cascading_walsender = RecoveryInProgress();
> 
> first.

Yea, that sounds reasonable.

I'm not a fan of hiding the timeline determination inside
RecoveryInProgress(), particularly not when communicated via global
variable. But that's not the fault of this patch.

Greetings,

Andres Freund