Обсуждение: Hot Standby Failover Scenario

Поиск
Список
Период
Сортировка

Hot Standby Failover Scenario

От
Lucky Haryadi
Дата:
Hi everybody.

I want to ask about hot-standby related issues. First of all, maybe I will describe my scenario of Postgres master-slave.

1. There are Master A and Slave B in different location, assumed different region of nation.
2. Configuring Master A and Slave B to become hot-standby is same as described in documentations.
3. When Master A fails to service, the database will failovered to Slave B by triggering with trigger file.
4. As soon as Slave B become standalone pg server, run pg_start_backup(), so that all transactions will only be recorded to WAL files.
5. Applications swinged to Standalone B, until Server A recovery is done. 
6. When Server A has recovered (but still offline), run pg_stop_backup() and copy all WAL files from B to A.
7. Once the WAL files copied to A, set A's configuration back to Master and B to Slave again (for B, change recovery.done to recovery.conf and remove the trigger file).
8. Bring up A, restart B and all applications will be swinged back to A.

I've tried these methods with no luck. Before A fails to service, condition is A has 10 million records, and B has 10 million records too. Then I failovered to B, manually, simulating that A failed to service. I run pg_start_backup() and inserting bunch of data, let say the current condition is A still 10 million, B 20 million. So I tried to copy WAL files from B to A and hope that when A up again, the records will intact to B, A 20 million and B 20 million and hot-standby streaming will run as well. But my experiments failed to do so.
I've checked the log and found that the timeline is invalid. On Slave B's log, it appeared that timeline of primary server (Master A) does not match target timeline of standby server. Can anyone suggest for this case? Any suggestions will be greatly appreciated. Thank you.

Re: Hot Standby Failover Scenario

От
Greg Smith
Дата:
On 02/27/2012 10:05 PM, Lucky Haryadi wrote:
> 3. When Master A fails to service, the database will failovered to Slave
> B by triggering with trigger file.

As soon as you trigger a standby, it changes it to a new timeline.  At 
that point, the series of WAL files diverges.  It's no longer possible 
to apply them to a system that is still on the original timeline, such 
as your original master A in this situation.  There's a good reason for 
that.  Let's say that A committed an additional transaction before it 
went down, but that commit wasn't replicated to B.  You can't just move 
records from B over anymore in that case.  The only way to make sure A 
is in sync again is to do a new base backup, which you can potentially 
accelerate using rsync to only copy what has changed.  I see a lot of 
people try to bypass one of the steps recommended in the manual using 
various schemes like yours, and they usually have a bug like this in 
there--sometimes obvious like this, sometimes subtle.  Trying to get too 
clever here is dangerous to your database.

Warning:  pgsql-hackers is the mailing list for people to discuss the 
development of PostgreSQL, not how to use it.  Questions like this 
should be asked on either the pgsql-admin or pgsql-general mailing list.  I'm not going to answer additional questions
likethis from you here 
 
on this list, and I doubt anyone else will either.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com