Обсуждение: BDR, ERROR: previous init failed, manual cleanup is required

Поиск
Список
Период
Сортировка

BDR, ERROR: previous init failed, manual cleanup is required

От
"Zhu, Joshua"
Дата:

 

Here is a BDR problem we ran into recently:

 

A BDR group with a pair of nodes, N1 and N2, group is created on N1, N2 joins the group, so far so good

N2 departs/rejoins the group a couple of times, then ran into an issue, with the following symptom, after executing bdr.bdr_group_join() on N2 wrt N1:

 

a)       N2 does not show up in N1’s bdr.bdr_nodes table (which now has only one record for N1 itself)

b)      in N2’s bdr.bdr_nodes table, there is also only one record (for N2 itself), with node_status stuck at i

c)       in N2’s log file, the following entries repeat (each time with a different PID):

 

< 2018-02-06 11:37:26.580 PST >LOG:  worker process: bdr db: testdb (PID 526) exited with exit code 1

< 2018-02-06 11:37:31.622 PST >ERROR:  previous init failed, manual cleanup is required

< 2018-02-06 11:37:31.622 PST >DETAIL:  Found bdr.bdr_nodes entry for N2 (6519523095740643074,1,17222,) with state=i in remote bdr.bdr_nodes

< 2018-02-06 11:37:31.622 PST >HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again

 

The log message suggests that there is a record for N2 in N1’s bdr.bdr_nodes table, but as noted above, that table has only one record (for N1) stated.

Also note is that, each time before N2 joins, it’s thoroughly cleaned up (including bdr.remove_bdr_from_local_node(), drop and recreate database, etc.)

 

Now, no matter what I do on N1 to cleanup, the problem persists… the following have been tried, to no avail:

 

a)       drop replication slots (if found)

b)      1. bdr.remove_bdr_from_local_node;  2. recreate BDR group

c)       1. drop bdr extension;  2. recreate bdr extension; 3. recreate BDR group

d)      recycle postgres server after b.1) and c.1)

 

While dropping and creating database likely would fix the problem, doing so is unthinkable in a production environment.

 

Would appreciate thoughts and suggestions,

Thanks

 

Re: BDR, ERROR: previous init failed, manual cleanup is required

От
Dan Wierenga
Дата:

On Wed, Feb 7, 2018 at 9:14 AM, Zhu, Joshua <jzhu@vormetric.com> wrote:

 

Here is a BDR problem we ran into recently:

 

A BDR group with a pair of nodes, N1 and N2, group is created on N1, N2 joins the group, so far so good

N2 departs/rejoins the group a couple of times, then ran into an issue, with the following symptom, after executing bdr.bdr_group_join() on N2 wrt N1:


FWIW, I was never able to successfully join a node with bdr.bdr_group_join.   I was only ever able to get it to work by using bdr_init_copy and letting it create the database on the target node for me. Run "SELECT bdr.bdr_node_join_wait_for_ready();" to make sure it bootstrapped properly.

I can't access my bdr cluster right now, but off the top of my head:
- check the bdr.bdr_connections table in addition to the nodes table.  
- make sure you run "select bdr.bdr_connections_changed();" after you manually delete from any of the bdr tables. 

RE: BDR, ERROR: previous init failed, manual cleanup is required

От
"Zhu, Joshua"
Дата:

Thanks a lot for your reply…

 

I’ve already gone with the extreme route of dropping/recreating the database before trying the bdr_connections_changed() call, which I’ll keep in mind next time when the same issue happens.

 

From: Dan Wierenga [mailto:dwierenga@gmail.com]
Sent: Wednesday, February 07, 2018 1:45 PM
To: Zhu, Joshua <jzhu@thalesesec.net>
Cc: pgsql-general@postgresql.org
Subject: Re: BDR, ERROR: previous init failed, manual cleanup is required

 

 

On Wed, Feb 7, 2018 at 9:14 AM, Zhu, Joshua <jzhu@vormetric.com> wrote:

 

Here is a BDR problem we ran into recently:

 

A BDR group with a pair of nodes, N1 and N2, group is created on N1, N2 joins the group, so far so good

N2 departs/rejoins the group a couple of times, then ran into an issue, with the following symptom, after executing bdr.bdr_group_join() on N2 wrt N1:

 

FWIW, I was never able to successfully join a node with bdr.bdr_group_join.   I was only ever able to get it to work by using bdr_init_copy and letting it create the database on the target node for me. Run "SELECT bdr.bdr_node_join_wait_for_ready();" to make sure it bootstrapped properly.

 

I can't access my bdr cluster right now, but off the top of my head:

- check the bdr.bdr_connections table in addition to the nodes table.  

- make sure you run "select bdr.bdr_connections_changed();" after you manually delete from any of the bdr tables.