Обсуждение: Fwd: Postgres HA - pacemaker RA do not support auto failback

Поиск
Список
Период
Сортировка

Fwd: Postgres HA - pacemaker RA do not support auto failback

От
Shital A
Дата:


Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in active-passive mode. 

Installed: 
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic failback. What I mean by that is explained below:
1. Cluster is setup like A - B with A as master. 
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file it creates on any one of the node and execute the cleanup command to get the node back as standby

Step 3 is manual so HA is not achieved in real sense. 

Please help to check:
1. Is there any version of the resouce agent which supports automatic failback? To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA to achieve 

Please suggest.
Thanks.

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

От
Adrien Nayrat
Дата:
On 8/12/19 2:57 PM, Shital A wrote:
>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info


Вложения

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

От
Shital A
Дата:


On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info> wrote:
On 8/12/19 2:57 PM, Shital A wrote:
>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info



Hello

We are using the pgsql RA installed with pacemaker. 

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that might not have been handled? 

Please advise.

Thanks.

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

От
Shital A
Дата:


On Tue, 13 Aug 2019, 11:50 Shital A, <brightuser2019@gmail.com> wrote:


On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info> wrote:
On 8/12/19 2:57 PM, Shital A wrote:
>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info



Hello

We are using the pgsql RA installed with pacemaker. 

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that might not have been handled? 

Please advise.

Thanks.


Looks like the previous email bounced, received undelivered message. Sending again. We are using the pgsql RA installed with pacemaker and seems like it prevents failback by creating a lock file. We want to implement auto failback and need pointers on how to implement. 

Thanks.


Re: Postgres HA - pacemaker RA do not support auto failback

От
"Jehan-Guillaume (ioguix) de Rorthais"
Дата:
On Tue, 13 Aug 2019 16:38:44 +0530
Shital A <brightuser2019@gmail.com> wrote:

> On Tue, 13 Aug 2019, 11:50 Shital A, <brightuser2019@gmail.com> wrote:
> 
> > On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info>
> > wrote:
> >  
> >> On 8/12/19 2:57 PM, Shital A wrote:  
> >> > Postgres version : 9.6
> >> > OS:Rhel 7.6
> >> >
> >> > We are working on HA setup for postgres cluster of two nodes in  
> >> > active-passive mode.
> >> >
> >> > Installed:
> >> > Pacemaker 1.1.19
> >> > Corosync 2.4.3
> >> >
> >> > The pacemaker agent with this installation doesn't support automatic  
> >> > failback.  
> >> > What I mean by that is explained below:
> >> > 1. Cluster is setup like A - B with A as master.
> >> > 2. Kill services on A, node B will come up as master.
> >> > 3. node A is ready to join the cluster, we have to delete the lock file  
> >> > it creates on any one of the node and execute the cleanup command to
> >> > get the node back as standby
> >> >
> >> > Step 3 is manual so HA is not achieved in real sense.
> >> >
> >> > Please help to check:
> >> > 1. Is there any version of the resouce agent which supports automatic  
> >> > failback?  
> >> > To avoid generation of lock file and deleting it.
> >> >
> >> > 2. If there is no such support, what checks should be added in pgsql RA  
> >> > to achieve 
> >>
> >> Which RA did you use? AFAIK there is two RA :
> >> - pgsql :
> >> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
> >> - PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
> >> RA :
> >> http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)
> >>
> >> If I am right, PAF should handle the case when there no need to perform
> >> pgrewind
> >> (if the old master is not too advanced in the transaction log).

PAF only make sure controlled failover (aka switchover) performs safely so the
old master hook back as a standby correctly.

Should a failure occurs with an automatic failover, PAF has no dark magic to
automagicaly failoback the failing master as a standby. 

> > We are using the pgsql RA installed with pacemaker.
> >
> > Is PAF recommended over pgsql?

Yes, as much as the historical RA.

> > Has anyone changed pgsql to handle the lock file and other cases that
> > might not have been handled?

I don't know. However, I do think it would be either presumptuous to climb this
road today.

> Looks like the previous email bounced, received undelivered message.
> Sending again. We are using the pgsql RA installed with pacemaker and seems
> like it prevents failback by creating a lock file. We want to implement
> auto failback and need pointers on how to implement.

The lock file is here because PgSQL has no safety belt against data corruption
when an old master hook on a new one while its LSN was further away from
the timeline fork. You must take care of this by yourself.

Moreover, should a failover occurs, you should probably check WHY it occurred
and fix things before performing a blindly failback of a potential wounded
instance. A wounded instance can hurt your cluster in many different and
inconfortable ways.

Last, as Adrien wrote, you can manually use pg_rewind to revert the old master
to the timeline fork, as far as you enabled checksums or 'wal_log_hints=on' and
kept enough WAL around.

Another option is just to make sure you are able to build a new standby as was
fast as possible.

But remember: if your data is important, use quorum, fencing and watchdog.

Good luck.