Обсуждение: [ADMIN] San replication corrupting postgres file...
Hi Team,
I am facing an issue with postgres replication between my primary and DR site. I have the following setup,
1. I am trying to replicate LVM level sanpshot on SAN which does a block level replication.
2. OS Details : RHEL 7.1 kernel 3.10
3. Postgres Version : ( 9.6)
The steps performed:
1. Stop all the containers running on the OS.
2. Stop the SAN level replication.
3. Switch over to the replicated site.
4. Start the containers
Here the postgres container fails with the blow error which looks like data corruption.
========
LOG: database system was interrupted; last known up at 2017-04-28 15:58:45 UTC
LOG: invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG: invalid primary checkpoint record
LOG: invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG: invalid secondary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 18) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
LOG: database system is shut down
=======
I have tried the graceful shutdown of the microservices but still the replication fails. Strange issues id i have other instance of postgres (9.4.1 )which runs absolutely fine. Could someone please provide some advice?
Thanks
Rahul
On Mon, May 1, 2017 at 1:32 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote: > Hi Team, > > I am facing an issue with postgres replication between my primary and DR > site. I have the following setup, > > 1. I am trying to replicate LVM level sanpshot on SAN which does a block > level replication. > 2. OS Details : RHEL 7.1 kernel 3.10 > 3. Postgres Version : ( 9.6) > > The steps performed: > > 1. Stop all the containers running on the OS. > 2. Stop the SAN level replication. > 3. Switch over to the replicated site. > 4. Start the containers > > Here the postgres container fails with the blow error which looks like data > corruption. > > ======== > > LOG: database system was interrupted; last known up at 2017-04-28 15:58:45 > UTC > LOG: invalid magic number 7270 in log segment 000000010000000000000001, > offset 0 > LOG: invalid primary checkpoint record > LOG: invalid magic number 7270 in log segment 000000010000000000000001, > offset 0 > LOG: invalid secondary checkpoint record > PANIC: could not locate a valid checkpoint record > LOG: startup process (PID 18) was terminated by signal 6: Aborted > LOG: aborting startup due to startup process failure > LOG: database system is shut down > > ======= > > I have tried the graceful shutdown of the microservices but still the > replication fails. Strange issues id i have other instance of postgres > (9.4.1 )which runs absolutely fine. Could someone please provide some > advice? Are your pg xlog and data directories on different volumes? If so then vm snapshots are likely to not be coherent due to timing etc. Is there a reason you're NOT using pgsql's built in streaming replication?
Hi Scott,
My architecture is as follows
I have a primary server and its own LVM with the data directories pointing to its own SAN. On the DR end we have a similar set up with its on LVM pointing to its own directory structure and pointing to its own SAN . The replication happens between primary and DR SAN.
The reason we opted for this architecture is we a re using multiple data base types and to maintain data integrity b/w these we take lvm level snap shots .
Thanks
Rahul
On Mon, May 1, 2017 at 2:39 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Are your pg xlog and data directories on different volumes? If so thenOn Mon, May 1, 2017 at 1:32 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote:
> Hi Team,
>
> I am facing an issue with postgres replication between my primary and DR
> site. I have the following setup,
>
> 1. I am trying to replicate LVM level sanpshot on SAN which does a block
> level replication.
> 2. OS Details : RHEL 7.1 kernel 3.10
> 3. Postgres Version : ( 9.6)
>
> The steps performed:
>
> 1. Stop all the containers running on the OS.
> 2. Stop the SAN level replication.
> 3. Switch over to the replicated site.
> 4. Start the containers
>
> Here the postgres container fails with the blow error which looks like data
> corruption.
>
> ========
>
> LOG: database system was interrupted; last known up at 2017-04-28 15:58:45
> UTC
> LOG: invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG: invalid primary checkpoint record
> LOG: invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG: invalid secondary checkpoint record
> PANIC: could not locate a valid checkpoint record
> LOG: startup process (PID 18) was terminated by signal 6: Aborted
> LOG: aborting startup due to startup process failure
> LOG: database system is shut down
>
> =======
>
> I have tried the graceful shutdown of the microservices but still the
> replication fails. Strange issues id i have other instance of postgres
> (9.4.1 )which runs absolutely fine. Could someone please provide some
> advice?
vm snapshots are likely to not be coherent due to timing etc.
Is there a reason you're NOT using pgsql's built in streaming replication?
On Mon, May 1, 2017 at 2:20 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote: > Hi Scott, > > My architecture is as follows > > I have a primary server and its own LVM with the data directories pointing > to its own SAN. On the DR end we have a similar set up with its on LVM > pointing to its own directory structure and pointing to its own SAN . The > replication happens between primary and DR SAN. This doesn't really answer the question I asked "Are your pg xlog and data directories on different volumes?"
Hello,
Let me clarify one thing,
What do you mean " postgres replication" ??
The wal archive or stream replication mechanism provided by postgres native mechanism .
Or just file system snapshot .
Which one your are implementing ???
Steven
2017-05-02 3:32 GMT+08:00 Rahul Sharma <rahulsharma0525@gmail.com>:
Hi Team,I am facing an issue with postgres replication between my primary and DR site. I have the following setup,1. I am trying to replicate LVM level sanpshot on SAN which does a block level replication.2. OS Details : RHEL 7.1 kernel 3.103. Postgres Version : ( 9.6)The steps performed:1. Stop all the containers running on the OS.2. Stop the SAN level replication.3. Switch over to the replicated site.4. Start the containersHere the postgres container fails with the blow error which looks like data corruption.========LOG: database system was interrupted; last known up at 2017-04-28 15:58:45 UTCLOG: invalid magic number 7270 in log segment 000000010000000000000001, offset 0LOG: invalid primary checkpoint recordLOG: invalid magic number 7270 in log segment 000000010000000000000001, offset 0LOG: invalid secondary checkpoint recordPANIC: could not locate a valid checkpoint recordLOG: startup process (PID 18) was terminated by signal 6: AbortedLOG: aborting startup due to startup process failureLOG: database system is shut down=======I have tried the graceful shutdown of the microservices but still the replication fails. Strange issues id i have other instance of postgres (9.4.1 )which runs absolutely fine. Could someone please provide some advice?ThanksRahul