Re: Synchronization levels in SR

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: Synchronization levels in SR
Дата
Msg-id AANLkTinZdITDHp-F_xLTYA-ZL8XV_hRcKt5IVICNTSn4@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Synchronization levels in SR  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Synchronization levels in SR  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Synchronization levels in SR  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Synchronization levels in SR  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, May 26, 2010 at 10:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If the remote server responded first, then that proves it is a better
> candidate for failover than the one you think of as near. If the two
> standbys vary over time then you have network problems that will
> directly affect the performance on the master; synch_rep = N would
> respond better to any such problems.

No. The remote standby might respond first temporarily though it's almost
behind the near one. The read-only queries or incrementally updated
backup operation might cause a bursty disk write, and delay the ACK from
the standby. The lock contention between read-only queries and recovery
would delay the ACK. So the standby which responds first is not always
the best candidate for failover. Also the administrator generally doesn't
put the remote standby under the control of a clusterware like heartbeat.
In this case, the remote standby will never be the candidate for failover.
But quorum commit cannot cover this simple case.

>> OTOH, "synchronous_replication=2" degrades the
>> performance on the master very much.
>
> Yes, but only because you have only one near standby. It would clearly
> to be foolish to make this setting without 2+ near standbys. We would
> then have 4 or more servers; how do we specify everything for that
> config??

If you always want to use the near standby as the candidate for failover
by using quorum commit in the above simple case, you would need to choose
such a foolish setting. Otherwise, unfortunately you might have to failover
to the remote standby not under the control of a clusterware.

>> "synchronous_replication" approach
>> doesn't seem to cover the typical use case.
>
> You described the failure modes for the quorum proposal, but avoided
> describing the failure modes for the "per-standby" proposal.
>
> Please explain what will happen when the near server is unavailable,
> with per-standby settings. Please also explain what will happen if we
> choose to have 4 or 5 servers to maintain performance in case of the
> near server going down. How will we specify the failure modes?

I'll try to explain that.

(1) most standard case: 1 master + 1 "sync" standby (near)   When the master goes down, something like a clusterware
detectsthat   failure, and brings the standby online. Since we can ensure that the   standby has all the committed
transactions,failover doesn't cause   any data loss.
 
   When the standby goes down or network outage happens, walsender   detects that failure via the replication timeout,
keepaliveor error   return from the system calls. Then walsender does something according   to the specified reaction
(GUC)to the failure of the standby, e.g.,   walsender wakes the transaction commit up from the wait-for-ACK, and
exits.Then the master runs standalone.
 

(2) 1 master + 1 "sync" standby (near) + 1 "async" standby (remote)   When the master goes down, something like a
clusterwarebrings the   "sync" standby in the near location online. The administrator would   need to take a fresh base
backupof the new master, load it on the   remote standby, change the primary_conninfo, and restart the remote
standby.
   When one of standbys goes down, walsender does the same thing described   in (1). Until the failed standby has
restarted,the master runs together   with another standby.
 

In (1) and (2), after some failure happens, there would be only one server
which is guaranteed to have all the committed transactions. When it also
goes down, the database service stops. If you want to avoid this fragile
situation, you would need to add one more "sync" standby in the near site.

(3) 1 master + 2 "sync" standbys (near) + 1 "async" standby (remote)   When the master goes down, something like a
clusterwarebrings the   one of "sync" standbys online by using some selection algorithm.   The administrator would need
totake a fresh base backup of the new   master, load it on both remaining standbys, change the primary_conninfo,   and
restartthem.
 
   When one of standbys goes down, walsender does the same thing described   in (1). Until the failed standby has
restarted,the master runs together   with two standbys. At least one standby is guaranteed to be sync with   the
master.

Is this explanation enough?

>> Also, when "synchronous_replication=1" and one of synchronous standbys
>> goes down, how should the surviving standby catch up with the master?
>> Such standby might be too far behind the master. The transaction commit
>> should wait for the ACK from the lagging standby immediately even if
>> there might be large gap? If yes, "synch_rep_timeout" would screw up
>> the replication easily.
>
> That depends upon whether we send the ACK at point #2, #3 or #4. It
> would only cause a problem if you waited until #4.

Yeah, the problem happens. If we implement quorum commit, we need to
design how the surviving standby catches up with the master.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: functional call named notation clashes with SQL feature
Следующее
От: Abhijit Menon-Sen
Дата:
Сообщение: Re: functional call named notation clashes with SQL feature