Обсуждение: D.R. Site Failover (Streaming Replication) - user access / network options

Поиск
Список
Период
Сортировка

D.R. Site Failover (Streaming Replication) - user access / network options

От
CS DBA
Дата:
Hi All;

We are assisting a client with an Oracle to PostgreSQL conversion. They
did not have any replication with Oracle due to the version they ran,
however moving to PostgreSQL they want to setup a local HOT Standby and
a remote / DR Site standby. Setting up the Standby's is easy enough,
we've setup standby's multiple times and automated failover lots of
times as well.  We have in the past leveraged a number of approaches to
keep a failover seamless to the app. In most DR / remote site cases
we've recommended that the full stack be replicated so if we completely
loose a site (or cloud region) then we switch to the secondary site,
promote the secondary / DR site standby to a master and we're back in
business.

I do however have a few questions related to this, I'm interested to
find out what others have done here, in particular how do you go about
moving end users (assuming a web app is the end user entry point) to
point seamlessly to the secondary site?  Also how have you all dealt
with the possible split brain issue (i.e. we fail over, then the primary
site comes back up and existing/old connections to the old site then
write to the old master)

Thanks in advance for your feedback....



Re: D.R. Site Failover (Streaming Replication) - user access / network options

От
Fernando Hevia
Дата:


On Tue, Mar 8, 2016 at 1:48 PM, CS DBA <cs_dba@consistentstate.com> wrote:

I do however have a few questions related to this, I'm interested to find out what others have done here, in particular how do you go about moving end users (assuming a web app is the end user entry point) to point seamlessly to the secondary site?  Also how have you all dealt with the possible split brain issue (i.e. we fail over, then the primary site comes back up and existing/old connections to the old site then write to the old master)

While not seamlessly, you can achieve a pretty good failover rate by using DNS servers with short TTL (under 2 min). On failure, have your monitoring tool fire the failover scripts (promote postgres server, enable app server, etc.) and then change the apps DNS record with the secondary site IP address. In very short time you should have your users working on the secondary site.

Cloudflare or Amazon's Route 56 can provide the DNS capability. It is simple, reliable and cheap.

Once the primary site is back, split brain shouldn't be a problem since your DNS will keep forwarding traffic to your secondary site till you intervene to switch back.

Or... you can go with BGP and let the network team do the dirty work at the routing level. With BGP you should also expect somewhere between 10 and 120 seconds downtime till the route changes propagate.


Cheers,
Fernando.