Обсуждение: repmgr won't update witness after failover
https://github.com/2ndQuadrant/repmgr/blob/master/FAILOVER.rst
Hey,I have set up three nodes of postgresql 9.4 with repmgr in this way:1. master - node12. standby - node23. witness - node3Now I have set up the replication and the witness as it says here:
https://github.com/2ndQuadrant/repmgr/blob/master/FAILOVER.rstNow when I do 'kill -9 $(pidof postmaster)' The witness detects that something went wrong and fails over from node1 to node2But when I setup the replication now to work from node2 to node1 and I kill the postgresql process it doesn't failover and the repmgrd log shows the following message:unable to determine a valid master server; waiting 10 seconds to retry...it seems that the witness doesn't know about the new standby server..Has anyone got any idea about what am I doing wrong here?Best regards,Aviel Buskila
From: Aviel Buskila <aviel33@gmail.com>
Date: 2015-08-13 15:43 GMT+03:00
Subject: Re: [GENERAL] repmgr won't update witness after failover
To: Jony Cohen <jony.cohenjo@gmail.com>
Hi Aviel,you can use the 'show cluster' command to see the repmgr state before you do the 2nd failover - make sure the node1 is indeed marked as replica.After a failover the Master doesn't automatically attach to the new master - you need to point him as a slave (standby follow - if possible...)did you start the repmgrd on node1 after making it a replica of the new master? (it needs 2 daemons to decide what to promote)Regards,- Jony
Hey,I have set up three nodes of postgresql 9.4 with repmgr in this way:1. master - node12. standby - node23. witness - node3Now I have set up the replication and the witness as it says here:
https://github.com/2ndQuadrant/repmgr/blob/master/FAILOVER.rstNow when I do 'kill -9 $(pidof postmaster)' The witness detects that something went wrong and fails over from node1 to node2But when I setup the replication now to work from node2 to node1 and I kill the postgresql process it doesn't failover and the repmgrd log shows the following message:unable to determine a valid master server; waiting 10 seconds to retry...it seems that the witness doesn't know about the new standby server..Has anyone got any idea about what am I doing wrong here?Best regards,Aviel Buskila
Hi, did you make the old master follow the new one using repmgr?It doesn't update itself automatically...From the looks of it repmgr thinks you have 2 masters - the old one offline and the new one online.Regards,Jony
Sent from my iPhoneHey,I have just tried to start the repmgrd on the new standby after I have fixed it as a standby and still this goes the same way.from the message given in the repmgrd log in the witness server it seems that he is not able to elect a new master because he can't see anyone .I have check in the repl_nodes table in the witness and it shows:witness node3master node2master node1is there a way update the witness after the first failover?2015-08-13 15:06 GMT+03:00 Jony Cohen <jony.cohenjo@gmail.com>:Hi Aviel,you can use the 'show cluster' command to see the repmgr state before you do the 2nd failover - make sure the node1 is indeed marked as replica.After a failover the Master doesn't automatically attach to the new master - you need to point him as a slave (standby follow - if possible...)did you start the repmgrd on node1 after making it a replica of the new master? (it needs 2 daemons to decide what to promote)Regards,- JonyOn Thu, Aug 13, 2015 at 1:29 PM, Aviel Buskila <aviel33@gmail.com> wrote:Hey,I have set up three nodes of postgresql 9.4 with repmgr in this way:1. master - node12. standby - node23. witness - node3Now I have set up the replication and the witness as it says here:
https://github.com/2ndQuadrant/repmgr/blob/master/FAILOVER.rstNow when I do 'kill -9 $(pidof postmaster)' The witness detects that something went wrong and fails over from node1 to node2But when I setup the replication now to work from node2 to node1 and I kill the postgresql process it doesn't failover and the repmgrd log shows the following message:unable to determine a valid master server; waiting 10 seconds to retry...it seems that the witness doesn't know about the new standby server..Has anyone got any idea about what am I doing wrong here?Best regards,Aviel Buskila
El 14/08/15 a las 04:14, Aviel Buskila escribió: > Hey, > yes I did .. and still it wont fail back.. Can you send over the output of "repmgr cluster show" before and after the failover process? The output of SELECT * FROM repmgr_schema.repl_nodes; after the failover (you need to change repmgr_schema with what you have configured). Also, which version of repmgr are you running? > 2015-08-13 16:23 GMT+03:00 Jony Vesterman Cohen <jony.cohenjo@gmail.com>: > >> Hi, did you make the old master follow the new one using repmgr? >> >> It doesn't update itself automatically... >> From the looks of it repmgr thinks you have 2 masters - the old one >> offline and the new one online. Regards, -- Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
hey,
I have tried to set the configuration all over again, now the status of 'repl_nodes' before the failover is:
id | type | upstream_node_id | cluster | name | conninfo | priority | active
----+---------+---------------+------------------------------------------------------------+----------+---------
1 | master | | cluster_name |node1| host=node1 dbname=repmgr port=5432 user=repmgr | 100 | t
2 | standby| 1 | cluster_name |node2| host=node2 dbname=repmgr port=5432 user=repmgr | 100 | t
3 | witness| | cluster_name |node3| host=node3 dbname=repmgr port=5499 user=repmgr | 100 | t
repmgr is started on node2 and node3 (standby and witness) now when I kill postgresmaster process I can see in the
repmgrd log the following messages:
[WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
and than when it tried to elect node2 to be promoted it shows the following messages:
[DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[WARNING] unable to defermmine a valid master server; waiting 10 seconds to retry...
[ERROR] unable to determine a valid master node, terminating...
[INFO] repmgrd terminating..
what am I doing wrong?
> Hey,
> yes I did .. and still it wont fail back..
Can you send over the output of "repmgr cluster show" before and after
the failover process?
The output of SELECT * FROM repmgr_schema.repl_nodes; after the failover
(you need to change repmgr_schema with what you have configured).
Also, which version of repmgr are you running?
> 2015-08-13 16:23 GMT+03:00 Jony Vesterman Cohen <jony.cohenjo@gmail.com>:
>
>> Hi, did you make the old master follow the new one using repmgr?
>>
>> It doesn't update itself automatically...
>> From the looks of it repmgr thinks you have 2 masters - the old one
>> offline and the new one online.
Regards,
--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
hey,
I have tried to set the configuration all over again, now the status of 'repl_nodes' before the failover is:
id | type | upstream_node_id | cluster | name | conninfo | priority | active
----+---------+---------------+------------------------------------------------------------+----------+---------
1 | master | | cluster_name |node1| host=node1 dbname=repmgr port=5432 user=repmgr | 100 | t
2 | standby| 1 | cluster_name |node2| host=node2 dbname=repmgr port=5432 user=repmgr | 100 | t3 | witness| | cluster_name |node3| host=node3 dbname=repmgr port=5499 user=repmgr | 100 | t
repmgr is started on node2 and node3 (standby and witness) now when I kill postgresmaster process I can see in the
repmgrd log the following messages:
[WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
and than when it tried to elect node2 to be promoted it shows the following messages:
[DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[WARNING] unable to defermmine a valid master server; waiting 10 seconds to retry...
[ERROR] unable to determine a valid master node, terminating...
[INFO] repmgrd terminating..
what am I doing wrong?
El 14/08/15 a las 04:14, Aviel Buskila escribió:
> Hey,
> yes I did .. and still it wont fail back..
Can you send over the output of "repmgr cluster show" before and after
the failover process?
The output of SELECT * FROM repmgr_schema.repl_nodes; after the failover
(you need to change repmgr_schema with what you have configured).
Also, which version of repmgr are you running?
> 2015-08-13 16:23 GMT+03:00 Jony Vesterman Cohen <jony.cohenjo@gmail.com>:
>
>> Hi, did you make the old master follow the new one using repmgr?
>>
>> It doesn't update itself automatically...
>> From the looks of it repmgr thinks you have 2 masters - the old one
>> offline and the new one online.
Regards,
--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hey,I think I know what the problem is,after the first failover when I clone the old master to be standby with the 'repmgr standby clone' command it seems that nothing updates the repl_nodes table with the new standby in my cluster so on the next failover the repmgrd is failed to find a new upcoming standby to failover..this issue is confirmed after that I manually updated the repl_nodes table after the clone so that the old master is now a standby database.now my question is:Where does is suppose to happen that after I issue the 'repmgr standby clone' the repl_nodes should be updated too about the new standby server?Best regards,Aviel Buskila2015-08-16 12:11 GMT+03:00 Aviel Buskila <aviel33@gmail.com>:hey,
I have tried to set the configuration all over again, now the status of 'repl_nodes' before the failover is:
id | type | upstream_node_id | cluster | name | conninfo | priority | active
----+---------+---------------+------------------------------------------------------------+----------+---------
1 | master | | cluster_name |node1| host=node1 dbname=repmgr port=5432 user=repmgr | 100 | t
2 | standby| 1 | cluster_name |node2| host=node2 dbname=repmgr port=5432 user=repmgr | 100 | t3 | witness| | cluster_name |node3| host=node3 dbname=repmgr port=5499 user=repmgr | 100 | t
repmgr is started on node2 and node3 (standby and witness) now when I kill postgresmaster process I can see in the
repmgrd log the following messages:
[WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
and than when it tried to elect node2 to be promoted it shows the following messages:
[DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[WARNING] unable to defermmine a valid master server; waiting 10 seconds to retry...
[ERROR] unable to determine a valid master node, terminating...
[INFO] repmgrd terminating..
what am I doing wrong?
El 14/08/15 a las 04:14, Aviel Buskila escribió:
> Hey,
> yes I did .. and still it wont fail back..
Can you send over the output of "repmgr cluster show" before and after
the failover process?
The output of SELECT * FROM repmgr_schema.repl_nodes; after the failover
(you need to change repmgr_schema with what you have configured).
Also, which version of repmgr are you running?
> 2015-08-13 16:23 GMT+03:00 Jony Vesterman Cohen <jony.cohenjo@gmail.com>:
>
>> Hi, did you make the old master follow the new one using repmgr?
>>
>> It doesn't update itself automatically...
>> From the looks of it repmgr thinks you have 2 masters - the old one
>> offline and the new one online.
Regards,
--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hey,
Thanks for the reply, this helped me very much.
Kind Regards,
Aviel Buskila.
Hi,The clone command just clones the data from node2 to node1, you need to also register it with the `force` option to override the old record. (as if you're building a new replica node...)see:Regards,- JonyOn Sun, Aug 16, 2015 at 3:19 PM, Aviel Buskila <aviel33@gmail.com> wrote:Hey,I think I know what the problem is,after the first failover when I clone the old master to be standby with the 'repmgr standby clone' command it seems that nothing updates the repl_nodes table with the new standby in my cluster so on the next failover the repmgrd is failed to find a new upcoming standby to failover..this issue is confirmed after that I manually updated the repl_nodes table after the clone so that the old master is now a standby database.now my question is:Where does is suppose to happen that after I issue the 'repmgr standby clone' the repl_nodes should be updated too about the new standby server?Best regards,Aviel Buskila2015-08-16 12:11 GMT+03:00 Aviel Buskila <aviel33@gmail.com>:hey,
I have tried to set the configuration all over again, now the status of 'repl_nodes' before the failover is:
id | type | upstream_node_id | cluster | name | conninfo | priority | active
----+---------+---------------+------------------------------------------------------------+----------+---------
1 | master | | cluster_name |node1| host=node1 dbname=repmgr port=5432 user=repmgr | 100 | t
2 | standby| 1 | cluster_name |node2| host=node2 dbname=repmgr port=5432 user=repmgr | 100 | t3 | witness| | cluster_name |node3| host=node3 dbname=repmgr port=5499 user=repmgr | 100 | t
repmgr is started on node2 and node3 (standby and witness) now when I kill postgresmaster process I can see in the
repmgrd log the following messages:
[WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
and than when it tried to elect node2 to be promoted it shows the following messages:
[DEBUG] connecting to: 'host=node2 user=repmgr dbname=repmgr fallback_application_name='repmgr''
[WARNING] unable to defermmine a valid master server; waiting 10 seconds to retry...
[ERROR] unable to determine a valid master node, terminating...
[INFO] repmgrd terminating..
what am I doing wrong?
El 14/08/15 a las 04:14, Aviel Buskila escribió:
> Hey,
> yes I did .. and still it wont fail back..
Can you send over the output of "repmgr cluster show" before and after
the failover process?
The output of SELECT * FROM repmgr_schema.repl_nodes; after the failover
(you need to change repmgr_schema with what you have configured).
Also, which version of repmgr are you running?
> 2015-08-13 16:23 GMT+03:00 Jony Vesterman Cohen <jony.cohenjo@gmail.com>:
>
>> Hi, did you make the old master follow the new one using repmgr?
>>
>> It doesn't update itself automatically...
>> From the looks of it repmgr thinks you have 2 masters - the old one
>> offline and the new one online.
Regards,
--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services