Обсуждение: postmaster.pid still exists after pacemaker stopped postgresql - how to remove
postmaster.pid still exists after pacemaker stopped postgresql - how to remove
Hi all.
I’m running 2 node pacemaker/pgsql cluster in one technical center TC1 and the second 2 node pacemaker/pgsql cluster in the TC2.
After the streaming replication has been established between TC1 (master) and TC2 (slave) I’ve tried to migrate resources within TC1 from node 1 to node 2.
Pacemaker operation FAILED to stop resource postgres. However postgresql was stopped but postmaster.pid stayed corrupted. This situation didn’t occur when streaming replication was not established.
Why this happened?
How am I able to delete corrupted pid file? Standard way of deleting file “rm –f” does not work.
It looks like this:
[root@tstcaps01 data]# ll
ls: cannot access postmaster.pid: No such file or directory
total 56
drwx------ 7 postgres postgres 62 Jun 26 17:13 base
drwx------ 2 postgres postgres 4096 Aug 17 23:38 global
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_clog
-rw------- 1 postgres postgres 5127 Aug 17 16:24 pg_hba.conf
-rw------- 1 postgres postgres 1636 Jun 26 09:54 pg_ident.conf
drwx------ 2 postgres postgres 4096 Jul 2 00:00 pg_log
drwx------ 4 postgres postgres 34 Jun 26 09:53 pg_multixact
drwx------ 2 postgres postgres 17 Aug 17 23:38 pg_notify
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_serial
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_snapshots
drwx------ 2 postgres postgres 6 Aug 17 23:38 pg_stat_tmp
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_subtrans
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_tblspc
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_twophase
-rw------- 1 postgres postgres 4 Jun 26 09:53 PG_VERSION
drwx------ 3 postgres postgres 4096 Aug 17 23:38 pg_xlog
-rw------- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf
-rw------- 1 postgres postgres 71 Aug 17 23:38 postmaster.opts
?????????? ? ? ? ? ? postmaster.pid
-rw-r--r-- 1 postgres postgres 491 Aug 17 16:33 recovery.done
Some additional information…
[root@tstcaps01 data]# crm_mon -1
============
Last updated: Sun Aug 18 00:08:04 2013
Last change: Sat Aug 17 23:26:19 2013 via crm_resource on tstcaps01
Stack: openais
Current DC: tstcaps02 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 4 expected votes
6 Resources configured.
============
Online: [ tstcaps01 tstcaps02 ]
Resource Group: PGServer
pg_lvm (ocf::heartbeat:LVM): Started tstcaps01
pg_fs (ocf::heartbeat:Filesystem): Started tstcaps01
pg_lsb (lsb:postgresql-9.2): Started tstcaps01 (unmanaged) FAILED
pg_vip (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ tstcaps01 ]
Slaves: [ tstcaps02 ]
Failed actions:
pg_lsb_stop_0 (node=tstcaps01, call=90, rc=-2, status=Timed Out): unknown exec error
[root@tstcaps01 data]# rpm -qa | grep postgres
postgresql92-9.2.4-1PGDG.rhel6.x86_64
postgresql-libs-8.4.11-1.el6_2.x86_64
postgresql92-libs-9.2.4-1PGDG.rhel6.x86_64
postgresql92-server-9.2.4-1PGDG.rhel6.x86_64
postgresql92-devel-9.2.4-1PGDG.rhel6.x86_64
postgresql92-contrib-9.2.4-1PGDG.rhel6.x86_64
Any help would be appreciated.
Best regards,
Michal Mistina
Вложения
Re: postmaster.pid still exists after pacemaker stopped postgresql - how to remove
Hi there.
I didn’t find out why this issue happened. Only backup and format of the filesystem where corrupted postmaster.pid file existed helped to get rid of it. Hopefully the file won’t appear in the future.
Best regards,
Michal Mistina
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Mistina Michal
Sent: Sunday, August 18, 2013 12:11 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] postmaster.pid still exists after pacemaker stopped postgresql - how to remove
Hi all.
I’m running 2 node pacemaker/pgsql cluster in one technical center TC1 and the second 2 node pacemaker/pgsql cluster in the TC2.
After the streaming replication has been established between TC1 (master) and TC2 (slave) I’ve tried to migrate resources within TC1 from node 1 to node 2.
Pacemaker operation FAILED to stop resource postgres. However postgresql was stopped but postmaster.pid stayed corrupted. This situation didn’t occur when streaming replication was not established.
Why this happened?
How am I able to delete corrupted pid file? Standard way of deleting file “rm –f” does not work.
It looks like this:
[root@tstcaps01 data]# ll
ls: cannot access postmaster.pid: No such file or directory
total 56
drwx------ 7 postgres postgres 62 Jun 26 17:13 base
drwx------ 2 postgres postgres 4096 Aug 17 23:38 global
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_clog
-rw------- 1 postgres postgres 5127 Aug 17 16:24 pg_hba.conf
-rw------- 1 postgres postgres 1636 Jun 26 09:54 pg_ident.conf
drwx------ 2 postgres postgres 4096 Jul 2 00:00 pg_log
drwx------ 4 postgres postgres 34 Jun 26 09:53 pg_multixact
drwx------ 2 postgres postgres 17 Aug 17 23:38 pg_notify
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_serial
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_snapshots
drwx------ 2 postgres postgres 6 Aug 17 23:38 pg_stat_tmp
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_subtrans
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_tblspc
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_twophase
-rw------- 1 postgres postgres 4 Jun 26 09:53 PG_VERSION
drwx------ 3 postgres postgres 4096 Aug 17 23:38 pg_xlog
-rw------- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf
-rw------- 1 postgres postgres 71 Aug 17 23:38 postmaster.opts
?????????? ? ? ? ? ? postmaster.pid
-rw-r--r-- 1 postgres postgres 491 Aug 17 16:33 recovery.done
Some additional information…
[root@tstcaps01 data]# crm_mon -1
============
Last updated: Sun Aug 18 00:08:04 2013
Last change: Sat Aug 17 23:26:19 2013 via crm_resource on tstcaps01
Stack: openais
Current DC: tstcaps02 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 4 expected votes
6 Resources configured.
============
Online: [ tstcaps01 tstcaps02 ]
Resource Group: PGServer
pg_lvm (ocf::heartbeat:LVM): Started tstcaps01
pg_fs (ocf::heartbeat:Filesystem): Started tstcaps01
pg_lsb (lsb:postgresql-9.2): Started tstcaps01 (unmanaged) FAILED
pg_vip (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ tstcaps01 ]
Slaves: [ tstcaps02 ]
Failed actions:
pg_lsb_stop_0 (node=tstcaps01, call=90, rc=-2, status=Timed Out): unknown exec error
[root@tstcaps01 data]# rpm -qa | grep postgres
postgresql92-9.2.4-1PGDG.rhel6.x86_64
postgresql-libs-8.4.11-1.el6_2.x86_64
postgresql92-libs-9.2.4-1PGDG.rhel6.x86_64
postgresql92-server-9.2.4-1PGDG.rhel6.x86_64
postgresql92-devel-9.2.4-1PGDG.rhel6.x86_64
postgresql92-contrib-9.2.4-1PGDG.rhel6.x86_64
Any help would be appreciated.
Best regards,
Michal Mistina
Вложения
On Mon, Aug 26, 2013 at 9:53 PM, Mistina Michal <Michal.Mistina@virte.sk> wrote: > Hi there. > > I didn’t find out why this issue happened. Only backup and format of the > filesystem where corrupted postmaster.pid file existed helped to get rid of > it. Hopefully the file won’t appear in the future. I have encountered similar problem when I broke the filesystem by a double mount. You may have gotten the same problem. > Master/Slave Set: ms_drbd_pg [drbd_pg] > > Masters: [ tstcaps01 ] > > Slaves: [ tstcaps02 ] Why do you use DRBD with streaming replicatin? If you locates the database cluster on DRBD, it's better to check the status of DRBD filesystem. Regards, -- Fujii Masao
Re: Re: postmaster.pid still exists after pacemaker stopped postgresql - how to remove
Hi Masao. Thank you for suggestion. In deed that could occure. Most probably while I was testing split-brain situation. In that case I turned off network card on one node and on both nodes DRBD was in primary role. But after the split-brain occurred I resync DRBD so from two primaries I promoted one as "primary" (winner) and second one as "secondary" (victim). Data should be consistent by that moment. But probably it wasn't consistent. I am using DRBD only in one technical center. Data are syncing by streaming replication to the secondary technical center where is another DRBD instance. It's like this: TC1: --- node1: DRBD (primary), pgsql --- node2: DRBD (secondary), pgsql TC2: --- node1: DRBD (primary), pgsql --- node2: DRBD (secondary), pgsql Within one technical center only one pgsql runs only on one node. This is done by pacemaker/corosync. From the outside perspective it looks like only one postgresql server is running in one TC. TC1 (master) ==== streaming replication =====> TC2 (slave) If one node in technical center fails, the fail-over to secondary node is really quick. It's because fast network within technical center. Between TC1 and TC2 there is a WAN link. If something goes wrong and TC1 became unavailable I can switch manually / automatically to TC2. Is there more appropriate solution? Would you use something else? Best regards, Michal Mistina On Mon, Aug 26, 2013 at 9:53 PM, Mistina Michal <Michal.Mistina@virte.sk> wrote: > Hi there. > > I didn't find out why this issue happened. Only backup and format of > the filesystem where corrupted postmaster.pid file existed helped to > get rid of it. Hopefully the file won't appear in the future. I have encountered similar problem when I broke the filesystem by a double mount. You may have gotten the same problem. > Master/Slave Set: ms_drbd_pg [drbd_pg] > > Masters: [ tstcaps01 ] > > Slaves: [ tstcaps02 ] Why do you use DRBD with streaming replicatin? If you locates the database cluster on DRBD, it's better to check the status of DRBD filesystem. Regards, -- Fujii Masao
Вложения
On Mon, Aug 26, 2013 at 11:02 PM, Mistina Michal <Michal.Mistina@virte.sk> wrote: > Hi Masao. > Thank you for suggestion. In deed that could occure. Most probably while I > was testing split-brain situation. In that case I turned off network card on > one node and on both nodes DRBD was in primary role. But after the > split-brain occurred I resync DRBD so from two primaries I promoted one as > "primary" (winner) and second one as "secondary" (victim). Data should be > consistent by that moment. But probably it wasn't consistent. > > I am using DRBD only in one technical center. Data are syncing by streaming > replication to the secondary technical center where is another DRBD > instance. > > It's like this: > > TC1: > --- node1: DRBD (primary), pgsql > --- node2: DRBD (secondary), pgsql > > TC2: > --- node1: DRBD (primary), pgsql > --- node2: DRBD (secondary), pgsql > > Within one technical center only one pgsql runs only on one node. This is > done by pacemaker/corosync. > From the outside perspective it looks like only one postgresql server is > running in one TC. > TC1 (master) ==== streaming replication =====> TC2 (slave) > > If one node in technical center fails, the fail-over to secondary node is > really quick. It's because fast network within technical center. > Between TC1 and TC2 there is a WAN link. If something goes wrong and TC1 > became unavailable I can switch manually / automatically to TC2. > > Is there more appropriate solution? Would you use something else? Nope. I've heard the similar configuration, though it uses shared disk failover solution instead of DRBD. Regards, -- Fujii Masao