Обсуждение: [GENERAL] BDR question on dboid conflicts

Поиск
Список
Период
Сортировка

[GENERAL] BDR question on dboid conflicts

От
"Zhu, Joshua"
Дата:

Database oid is used in both bdr.bdr_nodes, as node_dboid, and bdr.bdr_connections, as conn_dboid, also used in construction of replication slot names.

 

I noticed that when trying to join a bdr group, if the database oid on the new node happens to be the same as that of an node already in the bdr group, the join would fail, and the only way to resolve the conflict that I was able to come up with has been to retry with dropping/recreating the database until the dboid does not conflict with any node already in the group.

 

Is there a better way to handle this kind of conflicts, especially doing so in a script?

 

Thanks

Re: [GENERAL] BDR question on dboid conflicts

От
Craig Ringer
Дата:
On 27 October 2017 at 01:15, Zhu, Joshua <jzhu@vormetric.com> wrote:
> Database oid is used in both bdr.bdr_nodes, as node_dboid, and
> bdr.bdr_connections, as conn_dboid, also used in construction of replication
> slot names.

Correct. However, it's used in conjunction with the sysid and node timeline ID.

> I noticed that when trying to join a bdr group, if the database oid on the
> new node happens to be the same as that of an node already in the bdr group,
> the join would fail, and the only way to resolve the conflict that I was
> able to come up with has been to retry with dropping/recreating the database
> until the dboid does not conflict with any node already in the group.

That is extremely surprising. In our regression tests the database
oids should be the same quite often, as we do various tests where we
create multiple instances. More importantly, every time you
bdr_init_copy, you get a clone with the same database oid, and that
works fine.

There's no detail here to work from, so I cannot guess what's actually
happening, but I can confidently say it's not a database oid conflict.
Nowhere in BDR should the database oid be considered without the rest
of the (sysid,timeline,dboid) tuple.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] BDR question on dboid conflicts

От
"Zhu, Joshua"
Дата:
Thanks, sounds like that's something unique in my environment/setup.

Here are the results of bdr.bdr_get_local_nodeid() for four nodes in a group,
Node 1: (6480169638493465053,1,16386)
Node 2: (6480169638493465053,1,20225)
Node 3: (6480169638493465053,1,29164)
Node 4: (6480169638493465053,1,20227)

And here is what pg_replication_slots table looks like on Node 4
bdr_20227_6480169638493465053_1_29164__ | bdr    | logical   |  20227 | mydb   | t      |       9603 |      |
7750| 0/2D4E780   | 0/2D4E7B8
 
bdr_20227_6480169638493465053_1_16386__ | bdr    | logical   |  20227 | mydb   | t      |       9602 |      |
7750| 0/2D4E780   | 0/2D4E7B8
 
bdr_20227_6480169638493465053_1_20225__ | bdr    | logical   |  20227 | mydb   | t      |       9601 |      |
7750| 0/2D4E780   | 0/2D4E7B8
 

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Craig Ringer
Sent: Thursday, October 26, 2017 7:24 PM
To: Zhu, Joshua <jzhu@thalesesec.net>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] BDR question on dboid conflicts

On 27 October 2017 at 01:15, Zhu, Joshua <jzhu@vormetric.com> wrote:
> Database oid is used in both bdr.bdr_nodes, as node_dboid, and 
> bdr.bdr_connections, as conn_dboid, also used in construction of 
> replication slot names.

Correct. However, it's used in conjunction with the sysid and node timeline ID.

> I noticed that when trying to join a bdr group, if the database oid on 
> the new node happens to be the same as that of an node already in the 
> bdr group, the join would fail, and the only way to resolve the 
> conflict that I was able to come up with has been to retry with 
> dropping/recreating the database until the dboid does not conflict with any node already in the group.

That is extremely surprising. In our regression tests the database oids should be the same quite often, as we do
varioustests where we create multiple instances. More importantly, every time you bdr_init_copy, you get a clone with
thesame database oid, and that works fine.
 

There's no detail here to work from, so I cannot guess what's actually happening, but I can confidently say it's not a
databaseoid conflict.
 
Nowhere in BDR should the database oid be considered without the rest of the (sysid,timeline,dboid) tuple.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: BDR question on dboid conflicts

От
higher_ground
Дата:
I actually just recently encountered this very same problem when calling
bdr_group_join().  The dboid generated is the same as that of an existing
node, and the tuple (sysid, timeline, dboid) is the same as well.

I saw this manifest two different ways in the logs:

A)
< 2021-03-30 02:42:56.942 UTC >FATAL:  could not send replication command
"CREATE_REPLICATION_SLOT "bdr_16386_6924805489516289687_1_17615__" LOGICAL
bdr": status PGRES_FATAL_ERROR: ERROR:  replication slot
"bdr_16386_6924805489516289687_1_17615__" already exists

B)
< 2021-03-30 21:02:29.260 UTC >LOG:  Creating replica with:  <…>
Restoring dump to <…>
< 2021-03-30 21:02:31.929 UTC >ERROR:  duplicate key value violates unique
constraint "bdr_nodes_pkey"
< 2021-03-30 21:02:31.929 UTC >DETAIL:  Key (node_sysid, node_timeline,
node_dboid)=(6924805489516289687, 1, 17615) already exists.

I did not see this issue previously on an earlier version of the OS we are
using.  The Postgres/BDR version has not changed either.

It seems that (on this platform, for the experiments I’ve tried thus far)
‘17615’ is always generated as the first dboid of an added node, hence the
conflict.  When we remove the node and try again, another dboid is
predictably tried.  In general (except for the addition of the 2nd node,
which is always successful), for the Nth node added, (N-1) tries are always
needed to ensure a unique dboid (and a unique tuple).

At this point it would be great if there is a way to avoid this
programmatically.  It seems that I can only detect this error condition in
the logs.  The bdr_group_join() call itself does not return error.

Is there a way to make the sysid, timeline, or dboid unique?

Thank you very much for your help.




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html