Обсуждение: [ADMIN] San replication corrupting postgres file...

Поиск
Список
Период
Сортировка

[ADMIN] San replication corrupting postgres file...

От
Rahul Sharma
Дата:
Hi Team,

I am facing an issue with postgres replication between my primary and DR site. I have the following setup,

1. I am trying to replicate LVM level sanpshot on SAN which does a block level replication.
2. OS Details : RHEL 7.1 kernel 3.10
3. Postgres Version :  ( 9.6)

The steps performed:

1. Stop all the containers running on the OS.
2. Stop the SAN level replication.
3. Switch over to the replicated site.
4. Start the containers

Here the postgres container fails with the blow error which looks like data corruption.

========

LOG:  database system was interrupted; last known up at 2017-04-28 15:58:45 UTC
LOG:  invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG:  invalid primary checkpoint record
LOG:  invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 18) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system is shut down

=======

I have tried the graceful shutdown of the microservices but still the replication fails. Strange issues id i have other instance of postgres (9.4.1 )which runs absolutely fine. Could someone please provide some advice?

Thanks
Rahul

Re: [ADMIN] San replication corrupting postgres file...

От
Scott Marlowe
Дата:
On Mon, May 1, 2017 at 1:32 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote:
> Hi Team,
>
> I am facing an issue with postgres replication between my primary and DR
> site. I have the following setup,
>
> 1. I am trying to replicate LVM level sanpshot on SAN which does a block
> level replication.
> 2. OS Details : RHEL 7.1 kernel 3.10
> 3. Postgres Version :  ( 9.6)
>
> The steps performed:
>
> 1. Stop all the containers running on the OS.
> 2. Stop the SAN level replication.
> 3. Switch over to the replicated site.
> 4. Start the containers
>
> Here the postgres container fails with the blow error which looks like data
> corruption.
>
> ========
>
> LOG:  database system was interrupted; last known up at 2017-04-28 15:58:45
> UTC
> LOG:  invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG:  invalid primary checkpoint record
> LOG:  invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 18) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
> LOG:  database system is shut down
>
> =======
>
> I have tried the graceful shutdown of the microservices but still the
> replication fails. Strange issues id i have other instance of postgres
> (9.4.1 )which runs absolutely fine. Could someone please provide some
> advice?

Are your pg xlog and data directories on different volumes? If so then
vm snapshots are likely to not be coherent due to timing etc.

Is there a reason you're NOT using pgsql's built in streaming replication?


Re: [ADMIN] San replication corrupting postgres file...

От
Rahul Sharma
Дата:
Hi Scott,

My architecture is as follows

I have a primary server and its own LVM  with the data directories pointing to its own SAN. On the DR end we have a similar set up with its on LVM pointing to its own directory structure and pointing to its own SAN . The replication happens between primary and DR SAN. 

The reason we opted for this architecture is we a re using multiple data base types and to maintain data integrity b/w these we take lvm level snap shots .

Thanks
Rahul

On Mon, May 1, 2017 at 2:39 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Mon, May 1, 2017 at 1:32 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote:
> Hi Team,
>
> I am facing an issue with postgres replication between my primary and DR
> site. I have the following setup,
>
> 1. I am trying to replicate LVM level sanpshot on SAN which does a block
> level replication.
> 2. OS Details : RHEL 7.1 kernel 3.10
> 3. Postgres Version :  ( 9.6)
>
> The steps performed:
>
> 1. Stop all the containers running on the OS.
> 2. Stop the SAN level replication.
> 3. Switch over to the replicated site.
> 4. Start the containers
>
> Here the postgres container fails with the blow error which looks like data
> corruption.
>
> ========
>
> LOG:  database system was interrupted; last known up at 2017-04-28 15:58:45
> UTC
> LOG:  invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG:  invalid primary checkpoint record
> LOG:  invalid magic number 7270 in log segment 000000010000000000000001,
> offset 0
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 18) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
> LOG:  database system is shut down
>
> =======
>
> I have tried the graceful shutdown of the microservices but still the
> replication fails. Strange issues id i have other instance of postgres
> (9.4.1 )which runs absolutely fine. Could someone please provide some
> advice?

Are your pg xlog and data directories on different volumes? If so then
vm snapshots are likely to not be coherent due to timing etc.

Is there a reason you're NOT using pgsql's built in streaming replication?

Re: [ADMIN] San replication corrupting postgres file...

От
Scott Marlowe
Дата:
On Mon, May 1, 2017 at 2:20 PM, Rahul Sharma <rahulsharma0525@gmail.com> wrote:
> Hi Scott,
>
> My architecture is as follows
>
> I have a primary server and its own LVM  with the data directories pointing
> to its own SAN. On the DR end we have a similar set up with its on LVM
> pointing to its own directory structure and pointing to its own SAN . The
> replication happens between primary and DR SAN.

This doesn't really answer the question I asked "Are your pg xlog and
data directories on different volumes?"


Re: [ADMIN] San replication corrupting postgres file...

От
Steven Chang
Дата:
Hello,

  Let me clarify one thing,
  What do you mean " postgres replication" ??
  The wal archive or stream replication mechanism provided by postgres native mechanism .
  Or just file system snapshot .
  Which one your are implementing ???

Steven

2017-05-02 3:32 GMT+08:00 Rahul Sharma <rahulsharma0525@gmail.com>:
Hi Team,

I am facing an issue with postgres replication between my primary and DR site. I have the following setup,

1. I am trying to replicate LVM level sanpshot on SAN which does a block level replication.
2. OS Details : RHEL 7.1 kernel 3.10
3. Postgres Version :  ( 9.6)

The steps performed:

1. Stop all the containers running on the OS.
2. Stop the SAN level replication.
3. Switch over to the replicated site.
4. Start the containers

Here the postgres container fails with the blow error which looks like data corruption.

========

LOG:  database system was interrupted; last known up at 2017-04-28 15:58:45 UTC
LOG:  invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG:  invalid primary checkpoint record
LOG:  invalid magic number 7270 in log segment 000000010000000000000001, offset 0
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 18) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system is shut down

=======

I have tried the graceful shutdown of the microservices but still the replication fails. Strange issues id i have other instance of postgres (9.4.1 )which runs absolutely fine. Could someone please provide some advice?

Thanks
Rahul