Обсуждение: WAL segment not replicated

Поиск
Список
Период
Сортировка

WAL segment not replicated

От
Ted EH
Дата:
I have a 2 node active/standby setup, with synchronous streaming enabled.

WAL segments are replicated as expected on the standby.

However, if I manually kill the postgres process with pkill on the primary I'm ending up with a standby WAL behind that of the primary. The primary seems to be incrementing its WAL automatically after I restart it.

Example: 
Before killing the pg process:
Primary & standby seem to be  synchronizing synchronously:

 pg_last_wal_receive_lsn
-------------------------
 0/5E0108B0
(1 row)

 pg_current_wal_lsn
--------------------
 0/5E0108B0
(1 row)

Now I do "pkill postgres", on the primary the WAL dir has
00000001000000000000005F as the lastest segment (file)  (supposed to be 5E, but unexpectedly getting incremented)
while the standby has:
00000001000000000000005E as its lastest segment 

The problem is if I want to restart the primary as a standby (swapping the roles), it will complain about asking for a WAL too far in the future that is not available on the new primary (old standby):

could not receive data from WAL stream: ERROR:  requested starting point 0/5F000000 is ahead of the WAL flush position of this server 0/5E0CBF38

requested starting point 0/5F000000 is ahead of the WAL flush position of this server 0/5E0D3200

Isn't the primary (original primary) expected to know how far is its standby?

Doing a base backup recovery is not an option for me at this point.

This is "pg_ctl (PostgreSQL) 10.2"

Re: WAL segment not replicated

От
Ian Barwick
Дата:
On 03/01/2018 11:43 AM, Ted EH wrote:
 > I have a 2 node active/standby setup, with synchronous streaming enabled.
 >
 > WAL segments are replicated as expected on the standby.
 >
 > However, if I manually kill the postgres process with pkill on the primary

Why are you using pkill (and how)?

 > I'm ending up with a standby WAL behind that of the primary. The primary
 > seems to be incrementing its WAL automatically after I restart it.
 >
 > Example:
 > Before killing the pg process:
 > Primary & standby seem to be  synchronizing synchronously:
 >
 >   pg_last_wal_receive_lsn
 > -------------------------
 >   0/5E0108B0
 > (1 row)
 >
 >   pg_current_wal_lsn
 > --------------------
 >   0/5E0108B0
 > (1 row)
 >
 > Now I do "pkill postgres", on the primary the WAL dir has
 > 00000001000000000000005F as the lastest segment (file)  (supposed to be 5E, but unexpectedly getting incremented)
 > while the standby has:
 > 00000001000000000000005E as its lastest segment
 >
 > The problem is if I want to restart the primary as a standby (swapping the roles),
 > it will complain about asking for a WAL too far in the future that is not available
 > on the new primary (old standby):
 >
 > could not receive data from WAL stream: ERROR:  requested starting point 0/5F000000 is ahead of the WAL flush
positionof this server 0/5E0CBF38
 
 >
 > requested starting point 0/5F000000 is ahead of the WAL flush position of this server 0/5E0D3200
 >
 > Isn't the primary (original primary) expected to know how far is its standby?

Nope. Once you promote the standby, it's an independent node; what happens
on the original primary is of no consequence.

If the primary was shut down cleanly *before* the standby was promoted,
you should be able to reattach it as a standby.

However it sounds like you've forced an unclean shutdown of your (former)
primary without everything being flushed to the standby. After restarting
the  (former) primary it's probably gone into automatic recovery, replaying
some WAL not flushed to the standby, thus creating a divergence from your
standby, which can't be resolved without manual intervention.

 > Doing a base backup recovery is not an option for me at this point.
 >
 > This is "pg_ctl (PostgreSQL) 10.2"

pg_rewind could potentially help:

   https://www.postgresql.org/docs/current/static/app-pgrewind.html


Regards

Ian Barwick


-- 
  Ian Barwick                   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL segment not replicated

От
Ted EH
Дата:
My reason for killing the process is to test behavior in the event of an unclean shutdown.

In the test, I made sure the former primary is being demoted to standby.
 
To keep things simple, I have repeated the test without restarting the former primary, and without pormoting the stby.

Before killing the main pg server process, both primary and stby have 00065 as the latest segment, under pg_wal 

This time all I did is "sudo pkill postmaster", after which under pg_wal

On the stby the latest segment is still 00064, while on the primary it is now 00065

Which means a process continues to run on the primary and creates the next WAL segment (...65). Is this expected? and why?