Patch for fail-back without fresh backup

Поиск
Список
Период
Сортировка
От Samrat Revagade
Тема Patch for fail-back without fresh backup
Дата
Msg-id CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=gGyu1KmT+s2xcQ-bw@mail.gmail.com
обсуждение исходный текст
Ответы Re: Patch for fail-back without fresh backup  (Benedikt Grundmann <bgrundmann@janestreet.com>)
Re: Patch for fail-back without fresh backup  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Patch for fail-back without fresh backup  (Amit Kapila <amit.kapila@huawei.com>)
Re: Patch for fail-back without fresh backup  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers

Hello,


We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that:

 

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

 

Let me again summarize the problem we are trying to address.

 

When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage.  So master contains some file system level changes that standby does not have.  At this point, the data directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master.  Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master.  This can happen in both the synchronous as well as asynchronous replication.

 

Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.

 

The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link.  This would break the service level agreement of disaster recovery system.  So there is need to improve the process of disaster recovery in PostgreSQL.  One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.

 

So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.

 

There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:

 

1. The main objection was raised by Tom and others is that we should not add this feature and should go with traditional way of taking fresh backup using the rsync, because he was concerned about the additional complexity of the patch and the performance overhead during normal operations.

 

2. Tom and others were also worried about the inconsistencies in the crashed master and suggested that its better to start with a fresh backup. Fujii Masao and others correctly countered that suggesting that we trust WAL recovery to clear all such inconsistencies and there is no reason why we can't do the same here.

 

3. Someone is suggested using rsync with checksum, but many pages on the two servers may differ in their binary values because of hint bits etc.

 

4. The major objection for failback without fresh backup idea was it may introduce on performance overhead and complexity to the code. By looking at the patch I must say that patch is not too complex. For performance impact I tested patch with pgbench which shows that it has very small performance overhead. Please refer the test results included at end of mail.

 

*Proposal to solve the problem*

 

The proposal is based on the concept of master should not do any file system level change until corresponding WAL record is replicated to the standby.

 

There are many places in the code which need to be handled to support the proposed solution.  Following cases explains the need of fresh backup at the time of failover, and how can we avoid this need by our approach.

 

1. We must not write any heap pages to the disk before the WAL records corresponding to those changes are received by the standby. Otherwise if standby failed to receive WAL corresponding to those heap pages there will be inconsistency.

 

2. When CHECKPOINT happens on the master, control file of master gets updated and last checkpoint record is written to it. Suppose failover happens and standby fails to receive the WAL record corresponding to CHECKPOINT, then master and standby has inconsistent copies of control file that leads to the mismatch in redo record and recovery will not start normally. To avoid this situation we must not update the control file of master before the corresponding checkpoint WAL record is received by the standby

 

3. Also when we truncate any of the physical files on the master and suppose the standby failed to receive corresponding WAL, then that physical file is truncated on master but still available on standby causing inconsistency. To avoid this we must not truncate physical files on the master before the WAL record corresponding to that operation is received by the standby.

 

4. Same case applies to CLOG pages. If CLOG page is written to the disk and corresponding WAL record is not replicated to the standby, leads to the inconsistency. So we must not write the CLOG pages (and may be other SLRU pages too) to the disk before the corresponding WAL records are received by standby.

 

5. The same problem applies for the commit hint bits. But it is more complicated than the other problems, because no WAL records are generated for that, hence we cannot apply the same above method, that is wait for corresponding WAL record to be replicated on standby.  So we delay the processes of updating the commit hint bits, similar to what is done by asynchronous commits.  In other words we need to check if the WAL corresponding to the transaction commit is received by the failback safe standby and then only allow hint bit updates.

 

 

*Patch explanation:*

 

The initial work on this patch is done by Pavan Deolasee. I tested it and will make further enhancements based on the community feedback.

 

This patch is not complete yet, but I plan to do so with the help of this community. At this point, the primary purpose is to understand the complexities and get some initial performance numbers to alleviate some of the concerns raised by the community.

 

There are two GUC parameters which supports this failsafe standby

 

1. failback_safe_standby_name  [ name of the failsafe standby ] It is the name of failsafe standby. Master will not do any file system level change before corresponding WAL is replicated on the this failsafe standby

 

2. failback_safe_standby_mode  [ off/remote_write/remote_flush] This parameter specifies the behavior of master i.e. whether it should wait for WAL to be written on standby or WAL to be flushed on standby.  We should turn it off when we do not want the failsafe standby. This failsafe mode can be combined with synchronous as well as asynchronous streaming replication.

 

Most of the changes are done in the syncrep.c. This is a slight misnomer because that file deals with synchronous standby and a failback standby could and most like be a async standby. But keeping the changes this way has ensured that the patch is easy to read. Once we have acceptance on the approach, the patch can be modified to reorganize the code in a more logical way.

 

The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync standby states. A backend which is waiting for a failback safe standby to receive WAL records, will wait in this state.  Failback safe mechanism can work in two different modes, that is wait for WAL to be written or flushed on failsafe standby. That is represented by two new modes  SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH respectively.

 

Also the SyncRepWaitForLSN() is changed for conditional wait. So that we can delay hint bit updates on master instead of blocking the wait for the failback safe standby to receiver WAL's.

 

*Benchmark tests*

 

*PostgreSQL versions:* PostgreSQL 9.3beta1

 

*Usage:* For operating in failsafe mode you need to configure following two GUC parameters:

 

1. failback_safe_standby_name

2.failback_safe_standby_mode

 

*Performance impact:*

 

The test are performed on the servers having 32 GB RAM, checkpoint_timeout  is set to 10 minutes so that checkpoint will happen more frequently. Checkpoint involves flushing all the dirty blocks to the disk and we wanted to primarily test that code path.

 

pgbech settings:

Transaction type: TPC-B

Scaling factor: 100

Query mode: simple

Number of clients: 100

Number of threads: 1

Duration: 1800 s

 

Following table shows the average TPS measured for each scenario. We conducted 3 tests for each scenario

 

1) Synchronous Replication - 947 tps

2) Synchronous Replication + Failsafe standby (off) - 934 tps

3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps

4) Asynchronous Replication - 1369 tps

5) Asynchronous Replication + Failsafe standby (off) - 1349 tps

6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps

 

By observing the table we can conclude following:

 

1. Streaming replication + failback safe:

a)  On an average, synchronous replication combined with failsafe standby (remote_flush) causes 1.68 % performance overhead.

b)  On an average, asynchronous streaming replication combined with failsafe standby (remote_flush) causes averagely 1.38 % performance degradation.

 

2. Streaming replication + failback safe (turned off):

a) Averagely synchronous replication combined with failsafe standby

(off) causes 1.37 % performance overhead.

b) Averagely  asynchronous streaming replication combined with failsafe standby (off) causes averagely 1.46 % performance degradation.

 

So the patch is showing 1-2% performance overhead.

 

Please give your suggestions if there is a need to perform tests for other scenario.

 

*Improvements (To-do):*

1. Currently this patch supports only one failback safe standby. It can either be synchronous or an asynchronous standby. We probably need to discuss whether it needs to be changed for support of multiple failsafe standby's.

2. Current design of patch will wait forever for the failback safe standby. Streaming replication also has same limitation. We probably need to discuss whether it needs to be changed.

 

There are couples of more places that probably need some attention and I have marked them with XXX

 

Thank you,

Samrat

 


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: KONDO Mitsumasa
Дата:
Сообщение: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Следующее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: Reduce maximum error in tuples estimation after vacuum.