Pacemaker dynamic membership

Поиск
Список
Период
Сортировка
От Nikolay Popov
Тема Pacemaker dynamic membership
Дата
Msg-id 5614CD4A.7020900@postgrespro.ru
обсуждение исходный текст
Ответы Re: Pacemaker dynamic membership
Список pgsql-admin
Hello.

We was looking the ways to utilize Corosync/Pacemaker stack for creating a 
high-availability cluster of PostgreSQL servers with automatic failover.

We are using Corosync (2.3.4) as a messaging layer and a stateful master/slave 
Resource Agent (pgsql) with Pacemaker (1.1.12) on CentOS 7.1.

Things work pretty well for a static cluster - where membership is defined up front. 
However, we needed to be able to seamlessly add new machines (node) to the cluster and remove 
existing ones from it, without service interruption. And we ran into a problem.

Is it possible to add a new node dynamically without interruption?


Here are the steps we are using to add a node:


# pcs property show
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: mycluster1
 dc-version: 1.1.13-a14efad
 have-watchdog: false
 last-lrm-refresh: 1444042099
 no-quorum-policy: stop
 stonith-action: reboot
 stonith-enabled: true
Node Attributes:
 pi01: pgsql-data-status=STREAMING|SYNC
 pi02: pgsql-data-status=STREAMING|POTENTIAL
 pi03: pgsql-data-status=LATEST

# pcs resource show --full
 Group: master-group
  Resource: vip-master (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.242.100 nic=eth0 cidr_netmask=24
   Operations: start interval=0s timeout=60s on-fail=restart (vip-master-start-interval-0s)
               monitor interval=10s timeout=60s on-fail=restart (vip-master-monitor-interval-10s)
               stop interval=0s timeout=60s on-fail=block (vip-master-stop-interval-0s)
  Resource: vip-rep (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.242.101 nic=eth0 cidr_netmask=24
   Meta Attrs: migration-threshold=0
   Operations: start interval=0s timeout=60s on-fail=stop (vip-rep-start-interval-0s)
               monitor interval=10s timeout=60s on-fail=restart (vip-rep-monitor-interval-10s)
               stop interval=0s timeout=60s on-fail=ignore (vip-rep-stop-interval-0s)
 Master: msPostgresql
  Meta Attrs: master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true
  Resource: pgsql (class=ocf provider=heartbeat type=pgsql)
   Attributes: pgctl=/usr/pgsql-9.5/bin/pg_ctl psql=/usr/pgsql-9.5/bin/psql pgdata=/var/lib/pgsql/9.5/data/ rep_mode=sync node_list="pi01 pi02 pi03" restore_command="cp /var/lib/pgsql/9.5/data/wal_archive/%f %p" primary_conninfo_opt="user=repl password=super-pass-for-repl keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip=192.168.242.100 restart_on_promote=true check_wal_receiver=true
   Operations: start interval=0s timeout=60s on-fail=restart (pgsql-start-interval-0s)
               monitor interval=4s timeout=60s on-fail=restart (pgsql-monitor-interval-4s)
               monitor role=Master timeout=60s on-fail=restart interval=3s (pgsql-monitor-interval-3s-role-Master)
               promote interval=0s timeout=60s on-fail=restart (pgsql-promote-interval-0s)
               demote interval=0s timeout=60s on-fail=stop (pgsql-demote-interval-0s)
               stop interval=0s timeout=60s on-fail=block (pgsql-stop-interval-0s)
               notify interval=0s timeout=60s (pgsql-notify-interval-0s)

# pcs cluster auth pi01 pi02 pi03 pi05 -u hacluster -p hacluster
pi01: Authorized
pi02: Authorized
pi03: Authorized
pi05: Authorized

# pcs cluster node add pi05 --start
pi01: Corosync updated
pi02: Corosync updated
pi03: Corosync updated
pi05: Succeeded
pi05: Starting Cluster...

# crm_mon -Afr1                                                                       
Last updated: Fri Oct  2 16:59:54 2015          Last change: Fri Oct  2 16:59:23 2015 by hacluster via crmd on pi02
Stack: corosync
Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum
4 nodes and 8 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started pi02
     vip-rep    (ocf::heartbeat:IPaddr2):       Started pi02
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ pi02 ]
     Slaves: [ pi01 pi03 ]
 fence-pi01     (stonith:fence_ssh):    Started pi02
 fence-pi02     (stonith:fence_ssh):    Started pi01
 fence-pi03     (stonith:fence_ssh):    Started pi01

Node Attributes:
* Node pi01:
    + master-pgsql                      : 100
    + pgsql-data-status                 : STREAMING|SYNC
    + pgsql-receiver-status             : normal
    + pgsql-status                      : HS:sync
* Node pi02:
    + master-pgsql                      : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000000008000098
    + pgsql-receiver-status             : ERROR
    + pgsql-status                      : PRI
* Node pi03:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : STREAMING|POTENTIAL
    + pgsql-receiver-status             : normal
    + pgsql-status                      : HS:potential
* Node pi05:

Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:

# pcs resource update msPostgresql pgsql master-max=1 master-node-max=1 clone-max=4 clone-node-max=1 notify=true

# crm_mon -Afr1                                               
Last updated: Fri Oct  2 17:04:36 2015          Last change: Fri Oct  2 17:04:07 2015 by root via
 cibadmin on pi01
Stack: corosync
Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum
4 nodes and 9 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Started pi02
     vip-rep    (ocf::heartbeat:IPaddr2):       Started pi02
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ pi02 ]
     Slaves: [ pi01 pi03 ]
     Stopped: [ pi05 ]
 fence-pi01     (stonith:fence_ssh):    Started pi02
 fence-pi02     (stonith:fence_ssh):    Started pi01
 fence-pi03     (stonith:fence_ssh):    Started pi01

Node Attributes:
* Node pi01:
    + master-pgsql                      : 100
    + pgsql-data-status                 : STREAMING|SYNC
    + pgsql-receiver-status             : normal
    + pgsql-status                      : HS:sync
* Node pi02:
    + master-pgsql                      : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000000008000098
    + pgsql-receiver-status             : ERROR
    + pgsql-status                      : PRI
* Node pi03:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : STREAMING|POTENTIAL
    + pgsql-receiver-status             : normal
    + pgsql-status                      : HS:potential
* Node pi05:
    + master-pgsql                      : -INFINITY
    + pgsql-status                      : STOP

Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:
   pgsql: migration-threshold=1 fail-count=1000000 last-failure='Fri Oct  2 17:04:13 2015'

Failed Actions:
* pgsql_start_0 on pi05 'unknown error' (1): call=27, status=complete, exitreason='My data may be
 inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
    last-rc-change='Fri Oct  2 17:04:10 2015', queued=0ms, exec=2553ms

# pcs resource update pgsql pgsql node_list="pi01 pi02 pi03 pi05"


And here we fall into the trouble pgsql-status is now STOP!!!!!!!!


# crm_mon -Afr1 

Last updated: Fri Oct  2 17:07:05 2015          Last change: Fri Oct  2 17:06:37 2015
 by root via cibadmin on pi01
Stack: corosync
Current DC: pi02 (version 1.1.13-a14efad) - partition with quorum
4 nodes and 9 resources configured

Online: [ pi01 pi02 pi03 pi05 ]

Full list of resources:

 Resource Group: master-group
     vip-master (ocf::heartbeat:IPaddr2):       Stopped
     vip-rep    (ocf::heartbeat:IPaddr2):       Stopped
 Master/Slave Set: msPostgresql [pgsql]
     Slaves: [ pi02 ]
     Stopped: [ pi01 pi03 pi05 ]
 fence-pi01     (stonith:fence_ssh):    Started pi02
 fence-pi02     (stonith:fence_ssh):    Started pi01
 fence-pi03     (stonith:fence_ssh):    Started pi01

Node Attributes:
* Node pi01:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : STREAMING|SYNC
    + pgsql-status                      : STOP
* Node pi02:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : LATEST
    + pgsql-status                      : STOP
* Node pi03:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : STREAMING|POTENTIAL
    + pgsql-status                      : STOP
* Node pi05:
    + master-pgsql                      : -INFINITY
    + pgsql-status                      : STOP

Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:


Do you know the way to add new node to cluster without this disruption? Maybe some command or something else?
-- 
Nikolay Popov
n.popov@postgrespro.ru
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

В списке pgsql-admin по дате отправления:

Предыдущее
От: Jan
Дата:
Сообщение: Re: Long-running and non-finishing VACUUM ANALYZE on large table
Следующее
От: Ramiro Barreca
Дата:
Сообщение: Re: Randomly slow queries over TCP, but not on localhost