Tom Lane wrote:
> Greg Smith <greg@2ndquadrant.com> writes:
> > I don't see this as needing any implementation any more complicated than
> > the usual way such timeouts are handled. Note how long you've been
> > trying to reach the standby. Default to -1 for forever. And if you hit
> > the timeout, mark the standby as degraded and force them to do a proper
> > resync when they disconnect. Once that's done, then they can re-enter
> > sync rep mode again, via the same process a new node would have done so.
>
> Well, actually, that's *considerably* more complicated than just a
> timeout. How are you going to "mark the standby as degraded"? The
> standby can't keep that information, because it's not even connected
> when the master makes the decision. ISTM that this requires
>
> 1. a unique identifier for each standby (not just role names that
> multiple standbys might share);
>
> 2. state on the master associated with each possible standby -- not just
> the ones currently connected.
>
> Both of those are perhaps possible, but the sense I have of the
> discussion is that people want to avoid them.
>
> Actually, #2 seems rather difficult even if you want it. Presumably
> you'd like to keep that state in reliable storage, so it survives master
> crashes. But how you gonna commit a change to that state, if you just
> lost every standby (suppose master's ethernet cable got unplugged)?
> Looks to me like it has to be reliable non-replicated storage. Leaving
> aside the question of how reliable it can really be if not replicated,
> it's still the case that we have noplace to put such information given
> the WAL-is-across-the-whole-cluster design.
I assumed we would have a parameter called "sync_rep_failure" that would
take a command and the command would be called when communication to the
slave was lost. If you restart, it tries again and might call the
function again.
-- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB
http://enterprisedb.com
+ It's impossible for everything to be true. +