Обсуждение: upgrade to repmgr3

Поиск
Список
Период
Сортировка

upgrade to repmgr3

От
Pekka Rinne
Дата:
hi!

I have been using postgres 9.4 and repmgr2.0 combination and been doing replication (hot standby). Now I'd like to start doing slot based replication and have installed repmgr3 and exeuted the provided sql scripts and also added use_replication_slots=1 into repmgr.conf.

The question is that what is the correct procedure to switch into using slots (max_replication_slots) in this case as the system has been set up already without them? Do I have to unregister and re-register all the standbys? Hopefully re-clone could be avoided.

What I tried was that I configured max_replication_hosts=5, restarted master, created some slots using select * from pg_create_physical_replication_slot(<name>), configured one created slot into recovery.conf in the slave. What I noticed was that replication seemed to be still working after this but in repl_nodes table slot_name remained empty. Then I did standby re-register with force and slot_name was filled with repmgr_slot_2 value which is not the name I gave for the slot. I think repmgr invented this name but in the pg_replication_slots table repmgr_slot_2 does not exist. There is only the slot I created myself (active=t). So I guess this approach is not quite right.

What if I just skip doing the re-register. Does is matter if slot_name remains empty in repl_nodes?

br,
Pekka

Re: upgrade to repmgr3

От
Ian Barwick
Дата:
Hi

On 08/04/2016 05:57 PM, Pekka Rinne wrote:
> hi!
>
> I have been using postgres 9.4 and repmgr2.0 combination and been doing
> replication (hot standby). Now I'd like to start doing slot based
> replication and have installed repmgr3 and exeuted the provided sql scripts
> and also added use_replication_slots=1 into repmgr.conf.
>
> The question is that what is the correct procedure to switch into using
> slots (max_replication_slots) in this case as the system has been set up
> already without them? Do I have to unregister and re-register all the
> standbys? Hopefully re-clone could be avoided.

No reclone needed.

> What I tried was that I configured max_replication_hosts=5, restarted
> master, created some slots using select * from
> pg_create_physical_replication_slot(<name>), configured one created slot
> into recovery.conf in the slave. What I noticed was that replication seemed
> to be still working after this but in repl_nodes table slot_name remained
> empty. Then I did standby re-register with force and slot_name was filled
> with repmgr_slot_2 value which is not the name I gave for the slot. I think
> repmgr invented this name but in the pg_replication_slots table
> repmgr_slot_2 does not exist. There is only the slot I created myself
> (active=t). So I guess this approach is not quite right.
>
> What if I just skip doing the re-register. Does is matter if slot_name
> remains empty in repl_nodes?

This won't affect replication, however if you attempt any failover actions
using repmgr (e.g. following a new master), it will probably cause problems when
attempting to create a replication slot on the new master.

As a workaround you can manually update the slot name in the repl_nodes table
to match the one you've chosen. We'll update repmgr to better handle this
kind of situation. I don't think we've had this particular use-case before,
so I'll add some notes to the documentation on how best to handle it.


Regards

Ian Barwick

--
  Ian Barwick                   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: upgrade to repmgr3

От
Pekka Rinne
Дата:
hello

Thanks for your comments. They are very helpful. If you have any draft level documentation available of the upgrade procedure I would be very interested in seeing it and maybe trying it out as well.

Meanwhile I did some more testing with my environment using repmgr3 and noticed an issue with promoting standby node. Here is roughly what I did.

1. Install repmgr3.1.2 RPM to all nodes as upgrade to previous 2.0.2.
2. I took repmgr release upgrade SQL-scripts from github 3.1 stable and ran those on master (all three scripts in order).
3. on master I stopped postgresql service
4. on standby I said standby promote which does some things and then hangs forever.

This standby promote was working fine before repmgr upgrade.

There is a COMMIT command visible with ps:

3324 ?        Ss     0:00 postgres: repmgr repmgr <new master IP>(43666) COMMIT waiting for 2/4E000548

What I see in the console is here:

[2016-08-08 10:29:03] [NOTICE] using configuration file "/var/lib/pgsql/repmgr/repmgr.conf"
[2016-08-08 10:29:03] [INFO] connecting to standby database
[2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<new master IP> user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2016-08-08 10:29:03] [INFO] connected to standby, checking its state
[2016-08-08 10:29:03] [DEBUG] is_standby(): SELECT pg_catalog.pg_is_in_recovery()
[2016-08-08 10:29:03] [INFO] retrieving node list for cluster 'sensorlog'
[2016-08-08 10:29:03] [DEBUG] get_master_connection():
  SELECT id, conninfo,          CASE WHEN type = 'master' THEN 1 ELSE 2 END AS type_priority    FROM "repmgr_sensorlog".repl_nodes    WHERE cluster = 'sensorlog'      AND type != 'witness' ORDER BY active DESC, type_priority, priority, id
[2016-08-08 10:29:03] [INFO] checking role of cluster node '1'
[2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<old master IP> user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2016-08-08 10:29:03] [ERROR] connection to database failed: could not connect to server: Connection refused
        Is the server running on host "<old master IP>" and accepting
        TCP/IP connections on port 5432?
[2016-08-08 10:29:03] [INFO] checking role of cluster node '2'
[2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<new master IP> user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2016-08-08 10:29:03] [DEBUG] is_standby(): SELECT pg_catalog.pg_is_in_recovery()
[2016-08-08 10:29:03] [INFO] checking role of cluster node '3'
[2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<old master IP> user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2016-08-08 10:29:03] [DEBUG] is_standby(): SELECT pg_catalog.pg_is_in_recovery()
[2016-08-08 10:29:03] [NOTICE] promoting standby
[2016-08-08 10:29:03] [DEBUG] get_pg_setting(): SELECT name, setting  FROM pg_settings WHERE name = 'data_directory'
[2016-08-08 10:29:03] [DEBUG] get_pg_setting(): returned value is "/var/lib/pgsql/data"
[2016-08-08 10:29:03] [NOTICE] promoting server using '/usr/bin/pg_ctl -D /var/lib/pgsql/data promote'
server promoting
[2016-08-08 10:29:03] [INFO] reconnecting to promoted server
[2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<new master IP> user=repmgr dbname=repmgr fallback_application_name='repmgr''
[2016-08-08 10:29:03] [DEBUG] is_standby(): SELECT pg_catalog.pg_is_in_recovery()
[2016-08-08 10:29:05] [DEBUG] is_standby(): SELECT pg_catalog.pg_is_in_recovery()
[2016-08-08 10:29:05] [DEBUG] setting node 2 as master and marking existing master as failed
[2016-08-08 10:29:05] [DEBUG] begin_transaction()
[2016-08-08 10:29:05] [DEBUG] commit_transaction()

The system is left in a strange state after this. If I start postgresql again in old master node and issue cluster show it lists both nodes as masters.

In this experiment I did not take slot based replication into use at all. Everything is left just as it was before except the repmgr upgrade.

br,
Pekka


2016-08-04 16:19 GMT+03:00 Ian Barwick <ian@2ndquadrant.com>:
Hi

On 08/04/2016 05:57 PM, Pekka Rinne wrote:
hi!

I have been using postgres 9.4 and repmgr2.0 combination and been doing
replication (hot standby). Now I'd like to start doing slot based
replication and have installed repmgr3 and exeuted the provided sql scripts
and also added use_replication_slots=1 into repmgr.conf.

The question is that what is the correct procedure to switch into using
slots (max_replication_slots) in this case as the system has been set up
already without them? Do I have to unregister and re-register all the
standbys? Hopefully re-clone could be avoided.

No reclone needed.

What I tried was that I configured max_replication_hosts=5, restarted
master, created some slots using select * from
pg_create_physical_replication_slot(<name>), configured one created slot
into recovery.conf in the slave. What I noticed was that replication seemed
to be still working after this but in repl_nodes table slot_name remained
empty. Then I did standby re-register with force and slot_name was filled
with repmgr_slot_2 value which is not the name I gave for the slot. I think
repmgr invented this name but in the pg_replication_slots table
repmgr_slot_2 does not exist. There is only the slot I created myself
(active=t). So I guess this approach is not quite right.

What if I just skip doing the re-register. Does is matter if slot_name
remains empty in repl_nodes?

This won't affect replication, however if you attempt any failover actions
using repmgr (e.g. following a new master), it will probably cause problems when
attempting to create a replication slot on the new master.

As a workaround you can manually update the slot name in the repl_nodes table
to match the one you've chosen. We'll update repmgr to better handle this
kind of situation. I don't think we've had this particular use-case before,
so I'll add some notes to the documentation on how best to handle it.


Regards

Ian Barwick

--
 Ian Barwick                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: upgrade to repmgr3

От
Martín Marqués
Дата:
Hi,

El 08/08/16 a las 05:57, Pekka Rinne escribió:
>
> Meanwhile I did some more testing with my environment using repmgr3 and
> noticed an issue with promoting standby node. Here is roughly what I did.
>
> 1. Install repmgr3.1.2 RPM to all nodes as upgrade to previous 2.0.2.
> 2. I took repmgr release upgrade SQL-scripts from github 3.1 stable and
> ran those on master (all three scripts in order).
> 3. on master I stopped postgresql service
> 4. on standby I said standby promote which does some things and then
> hangs forever.
>
> This standby promote was working fine before repmgr upgrade.
>
> There is a COMMIT command visible with ps:
>
> 3324 ?        Ss     0:00 postgres: repmgr repmgr <new master IP>(43666)
> COMMIT waiting for 2/4E000548

You mean it doesn't release the execution and give back the prompt?

Do you by chance have synchronous replication set? That ps output alone
doesn't say much, but being stuck on COMMIT normally points to failure
to sync the replication on a standby.

> [2016-08-08 10:29:03] [DEBUG] get_pg_setting(): returned value is
> "/var/lib/pgsql/data"
> [2016-08-08 10:29:03] [NOTICE] promoting server using '/usr/bin/pg_ctl
> -D /var/lib/pgsql/data promote'
> server promoting

Here, it runs the promote command.

> [2016-08-08 10:29:03] [INFO] reconnecting to promoted server
> [2016-08-08 10:29:03] [DEBUG] connecting to: 'host=<new master IP>
> user=repmgr dbname=repmgr fallback_application_name='repmgr''
> [2016-08-08 10:29:03] [DEBUG] is_standby(): SELECT
> pg_catalog.pg_is_in_recovery()
> [2016-08-08 10:29:05] [DEBUG] is_standby(): SELECT
> pg_catalog.pg_is_in_recovery()
> [2016-08-08 10:29:05] [DEBUG] setting node 2 as master and marking
> existing master as failed

At this point, the promoted standby is a primary server (master) and
repmgr will then update the nodes table to reflect that:

> [2016-08-08 10:29:05] [DEBUG] begin_transaction()
> [2016-08-08 10:29:05] [DEBUG] commit_transaction()

If this commit is what you are seeing stuck (from ps output from above),
the it's like that you have a mixture of synchronous_commit set on and
synchronous_standby_names having 1 standby which is not available.

What settings are you using for those 2 parameters?

> The system is left in a strange state after this. If I start postgresql
> again in old master node and issue cluster show it lists both nodes as
> masters.

That's not a surprise. This is called a brain split, something repmgr
doesn't fully take care of (we rely on other tools to do the fencing or
STONITH)

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: upgrade to repmgr3

От
Pekka Rinne
Дата:
hi

2016-08-09 15:39 GMT+03:00 Martín Marqués <martin@2ndquadrant.com>:
Hi,

El 08/08/16 a las 05:57, Pekka Rinne escribió:
>
> Meanwhile I did some more testing with my environment using repmgr3 and
> noticed an issue with promoting standby node. Here is roughly what I did.
>
> 1. Install repmgr3.1.2 RPM to all nodes as upgrade to previous 2.0.2.
> 2. I took repmgr release upgrade SQL-scripts from github 3.1 stable and
> ran those on master (all three scripts in order).
> 3. on master I stopped postgresql service
> 4. on standby I said standby promote which does some things and then
> hangs forever.
>
> This standby promote was working fine before repmgr upgrade.
>
> There is a COMMIT command visible with ps:
>
> 3324 ?        Ss     0:00 postgres: repmgr repmgr <new master IP>(43666)
> COMMIT waiting for 2/4E000548

You mean it doesn't release the execution and give back the prompt?

Do you by chance have synchronous replication set? That ps output alone
doesn't say much, but being stuck on COMMIT normally points to failure
to sync the replication on a standby.


Yeah, I learned that repmgr3 actually writes to DB during promote. Repmgr2 does not do that. And if the failed master itself is on the synchronized replicas list then the promote command hangs in commit as its not possible to sync to that failed node. Solution seemed to be to temporarily comment out synchronized replicas from postgresql.conf in new master and reload configfile. Then the promote command returns in command line.

br,
Pekka

Re: upgrade to repmgr3

От
Martín Marqués
Дата:
Hi,

2016-08-11 7:54 GMT-03:00 Pekka Rinne <tsierkkis@gmail.com>:
>>
>> Do you by chance have synchronous replication set? That ps output alone
>> doesn't say much, but being stuck on COMMIT normally points to failure
>> to sync the replication on a standby.
>>
>
> Yeah, I learned that repmgr3 actually writes to DB during promote. Repmgr2
> does not do that. And if the failed master itself is on the synchronized
> replicas list then the promote command hangs in commit as its not possible
> to sync to that failed node. Solution seemed to be to temporarily comment
> out synchronized replicas from postgresql.conf in new master and reload
> configfile. Then the promote command returns in command line.

Solution would be having at least 2 standbys listed in
synchronous_standby_names. With only one node name there, if the
standby goes down, all the transactions that make changes will hang at
commit execution.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: upgrade to repmgr3

От
Pekka Rinne
Дата:
hi

2016-08-04 16:19 GMT+03:00 Ian Barwick <ian@2ndquadrant.com>:
Hi

On 08/04/2016 05:57 PM, Pekka Rinne wrote:
hi!

I have been using postgres 9.4 and repmgr2.0 combination and been doing
replication (hot standby). Now I'd like to start doing slot based
replication and have installed repmgr3 and exeuted the provided sql scripts
and also added use_replication_slots=1 into repmgr.conf.

The question is that what is the correct procedure to switch into using
slots (max_replication_slots) in this case as the system has been set up
already without them? Do I have to unregister and re-register all the
standbys? Hopefully re-clone could be avoided.

No reclone needed.

What I tried was that I configured max_replication_hosts=5, restarted
master, created some slots using select * from
pg_create_physical_replication_slot(<name>), configured one created slot
into recovery.conf in the slave. What I noticed was that replication seemed
to be still working after this but in repl_nodes table slot_name remained
empty. Then I did standby re-register with force and slot_name was filled
with repmgr_slot_2 value which is not the name I gave for the slot. I think
repmgr invented this name but in the pg_replication_slots table
repmgr_slot_2 does not exist. There is only the slot I created myself
(active=t). So I guess this approach is not quite right.

What if I just skip doing the re-register. Does is matter if slot_name
remains empty in repl_nodes?

This won't affect replication, however if you attempt any failover actions
using repmgr (e.g. following a new master), it will probably cause problems when
attempting to create a replication slot on the new master.

As a workaround you can manually update the slot name in the repl_nodes table
to match the one you've chosen. We'll update repmgr to better handle this
kind of situation. I don't think we've had this particular use-case before,
so I'll add some notes to the documentation on how best to handle it.


I tried this workaround. But what I noticed was that after promoting a standby as a master my replication slots were renamed in repl_nodes to repmgr_slot_2 and repmgr_slot_3. I did not use those names while creating them.

Is the correct procedure to assume that nodes slot name should be repmgr_slot_<ID>? This seems to be the case and slots can be created following this rule.

Another thing is that is it mandatory to pre-create the slots by using pg_create_physical_replication_slot()? Even if I do not do that replication seems to sill be working but pg_replication_slots table remains empty. It is only after promoting a standby as a master that one slot seems to get created automatically.

br,
Pekka