Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby
От | Heikki Linnakangas |
---|---|
Тема | Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby |
Дата | |
Msg-id | 5051CFD2.60103@iki.fi обсуждение исходный текст |
Ответ на | Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby (Fujii Masao <masao.fujii@gmail.com>) |
Ответы |
Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby
(Amit Kapila <amit.kapila@huawei.com>)
Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby (Fujii Masao <masao.fujii@gmail.com>) |
Список | pgsql-bugs |
On 12.09.2012 22:03, Fujii Masao wrote: > On Wed, Sep 12, 2012 at 8:47 PM,<amit.kapila@huawei.com> wrote: >> The following bug has been logged on the website: >> >> Bug reference: 7533 >> Logged by: Amit Kapila >> Email address: amit.kapila@huawei.com >> PostgreSQL version: 9.2.0 >> Operating system: Suse >> Description: >> >> M host is primary, S host is standby and CS host is cascaded standby. >> >> 1.Set up postgresql-9.2beta2/RC1 on all hosts. >> 2.Execute command initdb on host M to create fresh database. >> 3.Modify the configure file postgresql.conf on host M like this: >> listen_addresses = 'M' >> port = 15210 >> wal_level = hot_standby >> max_wal_senders = 4 >> hot_standby = on >> 4.modify the configure file pg_hba.conf on host M like this: >> host replication repl M/24 md5 >> 5.Start the server on host M as primary. >> 6.Connect one client to primary server and create a user ‘repl’ >> Create user repl superuser password '123'; >> 7.Use the command pg_basebackup on the host S to retrieve database of >> primary host >> pg_basebackup -D /opt/t38917/data -F p -x fetch -c fast -l repl_backup -P >> -v -h M -p 15210 -U repl –W >> 8. Copy one recovery.conf.sample from share folder of package to database >> folder of the host S. Then rename this file to recovery.conf >> 9.Modify the file recovery.conf on host S as below: >> standby_mode = on >> primary_conninfo = 'host=M port=15210 user=repl password=123' >> 10. Modify the file postgresql.conf on host S as follow: >> listen_addresses = 'S' >> 11.Start the server on host S as standby server. >> 12.Use the command pg_basebackup on the host CS to retrieve database of >> standby host >> pg_basebackup -D /opt/t38917/data -F p -x fetch -c fast -l repl_backup -P >> -v -h M -p 15210 -U repl –W >> 13.Modify the file recovery.conf on host CS as below: >> standby_mode = on >> primary_conninfo = 'host=S port=15210 user=repl password=123' >> 14. Modify the file postgresql.conf on host S as follow: >> listen_addresses = 'CS' >> 15.Start the server on host CS as Cascaded standby server node. >> 16. Try to connect a client to host CS but it gives error as: >> FATAL: the database system is starting up > > This procedures didn't reproduce the problem in HEAD. But when I restarted > the master server between the step 11 and 12, I was able to reproduce the > problem. > >> Observations related to bug >> ------------------------------ >> In the above scenario it is observed that Start-up process has read all data >> (in our defect scenario minRecoveryPoint is 5016220) till the position >> 5016220 and then it goes and check for recovery consistency by following >> condition in function CheckRecoveryConsistency: >> if (!reachedConsistency&& >> XLByteLE(minRecoveryPoint, EndRecPtr)&& >> XLogRecPtrIsInvalid(ControlFile->backupStartPoint)) >> >> At this point first two conditions are true but last condition is not true >> because still redo has not been applied and hence backupStartPoint has not >> been reset. So it does not signal postmaster regarding consistent stage. >> After this it goes and applies the redo and then reset backupStartPoint and >> then it goes to read next set of record. Since all records have been already >> read, so it starts waiting for the new record from the Standby node. But >> since there is no new record from Standby node coming so it keeps waiting >> for that and it does not get chance to recheck the recovery consistent >> level. And hence client connection does not get allowed. > > If cascaded standby starts a recovery at a normal checkpoint record, > this problem will not happen. Because if wal_level is set to hot_standby, > XLOG_RUNNING_XACTS WAL record always follows after the normal > checkpont record. So while XLOG_RUNNING_XACTS record is being replayed, > ControlFile->backupStartPoint can be reset, and then cascaded standby > can pass through the consistency test. > > The problem happens when cascaded standby starts a recovery at a > shutdown checkpoint record. In this case, no WAL record might follow > the checkpoint one yet. So, after replaying the shutdown checkpoint > record, cascaded standby needs to wait for new WAL record to appear > before reaching the code block for resetting ControlFile->backupStartPoint. > The cascaded standby cannot reach a consistent state and a client cannot > connect to the cascaded standby until new WAL has arrived. > > Attached patch will fix the problem. In this patch, if recovery is > beginning at a shutdown checkpoint record, any ControlFile fields > (like backupStartPoint) required for checking that an end-of-backup is > reached are not set at first. IOW, cascaded standby thinks that the > database is consistent from the beginning. This is safe because > a shutdown checkpoint record means that there is no running database > activity at that point and the database is in consistent state. Hmm, I think the CheckRecoveryConsistency() call in the redo loop is misplaced. It's called after we got a record from ReadRecord, but *before* replaying it (rm_redo). Even if replaying record X makes the system consistent, we won't check and notice that until we have fetched record X+1. In this particular test case, record X is a shutdown checkpoint record, but it could as well be a running-xacts record, or the record that reaches minRecoveryPoint. Does the problem go away if you just move the CheckRecoveryConsistency() call *after* rm_redo (attached)? - Heikki
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Amit KapilaДата:
Сообщение: Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby
Следующее
От: Amit KapilaДата:
Сообщение: Re: BUG #7533: Client is not able to connect cascade standby incase basebackup is taken from hot standby