Re: Synchronous Standalone Master Redoux

Поиск
Список
Период
Сортировка
От Hampus Wessman
Тема Re: Synchronous Standalone Master Redoux
Дата
Msg-id 4FFFCA78.6050906@hampuswessman.se
обсуждение исходный текст
Ответ на Re: Synchronous Standalone Master Redoux  (Shaun Thomas <sthomas@optionshouse.com>)
Ответы Re: Synchronous Standalone Master Redoux  (Bruce Momjian <bruce@momjian.us>)
Re: Synchronous Standalone Master Redoux  (Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com>)
Список pgsql-hackers
Hi all,

Here are some (slightly too long) thoughts about this.

Shaun Thomas skrev 2012-07-12 22:40:
> On 07/12/2012 12:02 PM, Bruce Momjian wrote:
>
>> Well, the problem also exists if add it as an internal database
>> feature --- how long do we wait to consider the standby dead, how do
>> we inform administrators, etc.
>
> True. Though if there is no secondary connected, either because it's not
> there yet, or because it disconnected, that's an easy check. It's the
> network lag/stall detection that's tricky.

It is indeed tricky to detect this. If you don't get an (immediate) 
reply from the secondary (and you never do!), then all you can do is 
wait and *eventually* (after how long? 250ms? 10s?) assume that there is 
no connection between them. The conclusion may very well be wrong 
sometimes. A second problem is that we still don't know if this is 
caused by some kind of network problems or if it's caused by the 
secondary not running. It's perfectly possible that both servers are 
working, but just can't communicate at the moment.

The thing is that what we do next (at least if our data is important and 
why otherwise use synchronous replication of any kind...) depends on 
what *did* happen. Assume that we have two database servers. At any time 
we need at most one primary database to be running. Without that 
requirement our data can get messed up completely... If HA is important 
to us, we may choose to do a failover to the secondary (and live without 
replication for the moment) if the primary fails. With synchronous 
repliction, we can do this without losing any data. If the secondary 
also dies, then we do lose data (and we'll know it!), but it might be an 
acceptable risk. If the secondary isn't permanently damaged, then we 
might even be able to get the data back after some down time. Ok, so 
that's one way to reconfigure the database servers on a failure. If the 
secondary fails instead, then we can do similarly and remove it from the 
"cluster" (or in other words, disable synchronous replication to the 
secondary). Again, we don't lose any data by doing this. We're taking a 
certain risk, however. We can't safely do a failover to the secondary 
anymore... So if the primary fails now, then the only way not to lose 
data is to hope that we can get it back from the failed machine (the 
failure may be temporary).

There's also the third possibility, of course, that the two servers are 
both up and running, but they can't communicate over the network at the 
moment (this is, by the way, a difference from RAID, I guess). What do 
we do then? Well, we still need at most one primary database server. 
We'll have to (somehow, which doesn't matter as much) decide which 
database to keep and consider the other one "down". Then we can just do 
as above (with all the same implications!). Is it always a good idea to 
keep the primary? No! What if you (as a stupid example) pull the network 
cable from the primary (or maybe turn off a switch so that it's isolated 
from most of the network)? In that case you probably want the secondary 
to take over instead. At least if you value service availability. At 
this point we can still do a safe failover too.

My point here is that if HA is important to you, then you may very well 
want to disable synchronous replication on a failure to avoid down time, 
but this has to be integrated with your overall failover / cluster 
management solution. Just having the primary automatically disable 
synchronous replication doesn't seem overly useful to me... If you're 
using synchronous replication to begin with, you probably want to *know* 
if you may have lost data or not. Otherwise, you will have to assume 
that you did and then you could frankly have been running async 
replication all along. If you do integrate it with your failover 
solution, then you can keep track of when it's safe to do a failover and 
when it's not, however, and decide how to handle each case.

How you decide what to do with the servers on failures isn't that 
important here, really. You can probably run e.g. Pacemaker on 3+ 
machines and have it check for quorums to accomplish this. That's a good 
approach at least. You can still have only 2 database servers (for cost 
reasons), if you want. PostgreSQL could have all this built-in, but I 
don't think it sounds overly useful to only be able to disable 
synchronous replication on the primary after a timeout. Then you can 
never safely do a failover to the secondary, because you can't be sure 
synchronous replication was active on the failed primary...

Regards,
Hampus


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: FW: Allow replacement of bloated primary key indexes without foreign key rebuilds
Следующее
От: Atri Sharma
Дата:
Сообщение: Regarding installation of FDW on Windows