Обсуждение: Investigate postgres 9.6.3 repmgr lag 4.0.4

Поиск
Список
Период
Сортировка

Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Mariel Cherkassky
Дата:
Hi,
During the weekend (on friday) one of my slaves suddenly stop syncing with the primary. I have 1 master + 2 standbys. When I arrived this morning to work I saw that one of the slaved has a big lag : 

postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM NOW() - pg_last_xact_replay_timestamp()) END AS log_delay;
   log_delay   
---------------
 173781.573457
(1 row)


The only rows in the repmgr.log : 

2018-06-23 19:02:48] [WARNING] unable to connect to upstream node "ptkpl-psgsqldb1" (node ID: 5)
[2018-06-23 19:02:48] [NOTICE] node has recovered, reconnecting
[2018-06-23 19:02:48] [NOTICE] reconnected to upstream node after 0 seconds

Now, How can I further investigate it ? my wal_keep_segment is assigned to 100 but since friday 261 wals were generated so I guess I dont have another option but to sync the node again. However, I want to understand why it happened. What can you advice me to check ?


Moreover,  I have 2 further questions : 
1)Since friday 260 wals were generated, it doesnt sounds like too much ? I didnt have any unusual work. A new wal is generated everytime the older one is getting full or every switch right ? Is there a way to set the size of the wal ? 
2)Any recommendation regarding the wal settings ? My current settings : 
max_wal_senders=10
wal_keep_segments=100
max_wal_size =4GB
min_wal_size = 1GB
checkpoint_completion_target = 0.7


Thanks

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Rui DeSousa
Дата:


On Jun 24, 2018, at 4:59 AM, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:


Now, How can I further investigate it ? my wal_keep_segment is assigned to 100 but since friday 261 wals were generated so I guess I dont have another option but to sync the node again. However, I want to understand why it happened. What can you advice me to check ?


What is the connectivity between the nodes, any firewalls?  What’s the settings for wal_sender_timeout and wal_receiver_timeout?  Why not use a replication slot or have it fail over to using the archived WALs instead of full database restore?

There should be other messages in Postgresql logs.


Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Flavio Henrique Araque Gurgel
Дата:


Em dom, 24 de jun de 2018 às 17:32, Rui DeSousa <rui@crazybean.net> escreveu:


On Jun 24, 2018, at 4:59 AM, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:


Now, How can I further investigate it ? my wal_keep_segment is assigned to 100 but since friday 261 wals were generated so I guess I dont have another option but to sync the node again. However, I want to understand why it happened. What can you advice me to check ?


What is the connectivity between the nodes, any firewalls?  What’s the settings for wal_sender_timeout and wal_receiver_timeout?  Why not use a replication slot or have it fail over to using the archived WALs instead of full database restore?

There should be other messages in Postgresql logs.


Actually there are just two simple questions for the OP: Why you're not running the latest 9.6 and why you're not using replication slots? 

Since you'll have to recreate the lost standby, I strongly recommend going for replication slots to avoid the situation of standby servers getting out of sync, you won't need to mess with wal_keep_segments anymore.

Flavio Gurgel

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Ron
Дата:
On 06/24/2018 11:22 AM, Flavio Henrique Araque Gurgel wrote:
[snip]

Actually there are just two simple questions for the OP: Why you're not running the latest 9.6

Besides "he hasn't gotten around to it", there are companies that allocate a few hours of downtime once per year (usually on Christmas Day) for critical databases managing process control in factories.

--
Angular momentum makes the world go 'round.

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Mariel Cherkassky
Дата:
To be honest I used most of the default settings of repmgr and didnt focus on replication slot because I didnt saw that is was mentioned in the documentation. I will schedule soon a maintenance work to upgrade the cluster to version 10. I dont have any firewalls between all the nodes. So the optimal use is configure replication slot. Another question ,  "have it fail over to using the archived WALs instead of full database restore" How do I configure this ?

Thanks , Mariel.

2018-06-24 19:37 GMT+03:00 Ron <ronljohnsonjr@gmail.com>:
On 06/24/2018 11:22 AM, Flavio Henrique Araque Gurgel wrote:
[snip]

Actually there are just two simple questions for the OP: Why you're not running the latest 9.6

Besides "he hasn't gotten around to it", there are companies that allocate a few hours of downtime once per year (usually on Christmas Day) for critical databases managing process control in factories.

--
Angular momentum makes the world go 'round.

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Mariel Cherkassky
Дата:
Moreover what of the next options do you recommend using in repmgr : 
1)Using replication slots 
2)Using barman

Thanks.

2018-06-25 9:44 GMT+03:00 Mariel Cherkassky <mariel.cherkassky@gmail.com>:
To be honest I used most of the default settings of repmgr and didnt focus on replication slot because I didnt saw that is was mentioned in the documentation. I will schedule soon a maintenance work to upgrade the cluster to version 10. I dont have any firewalls between all the nodes. So the optimal use is configure replication slot. Another question ,  "have it fail over to using the archived WALs instead of full database restore" How do I configure this ?

Thanks , Mariel.

2018-06-24 19:37 GMT+03:00 Ron <ronljohnsonjr@gmail.com>:
On 06/24/2018 11:22 AM, Flavio Henrique Araque Gurgel wrote:
[snip]

Actually there are just two simple questions for the OP: Why you're not running the latest 9.6

Besides "he hasn't gotten around to it", there are companies that allocate a few hours of downtime once per year (usually on Christmas Day) for critical databases managing process control in factories.

--
Angular momentum makes the world go 'round.

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Rui DeSousa
Дата:


On Jun 25, 2018, at 2:44 AM, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:

 "have it fail over to using the archived WALs instead of full database restore" How do I configure this ?


With Postgres replication, it’s configured it in the recovery.conf file using the “restore_command”.  It would amount to a some script that connect into your backups and pulls the requested WAL file.


When you say no firewall; that is bit confusing and I’m left assuming that the nodes are on the same subnet?  I normally only use replication slots with either a backup solution or a replia that is going over a WAN.  I am bit perplex why replication would fall that far behind on a local network (send lag not replay lag).  What is the interconnect; is it gigabit or 10g and what the volume of WALs being generated? Might have a network related issue here.

I haven’t used repmgr; thus I can’t help there.

Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Mariel Cherkassky
Дата:
Hi all,
it happened again. The weird thing is that when I query pg_stat_replication I see only one slave(the one that is still synced) and I dont see the second one. Moreover,  I dont see anything in the repmgr log of the primary and in the slave regarding the disconnection...

2018-06-25 17:21 GMT+03:00 Rui DeSousa <rui@crazybean.net>:


On Jun 25, 2018, at 2:44 AM, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:

 "have it fail over to using the archived WALs instead of full database restore" How do I configure this ?


With Postgres replication, it’s configured it in the recovery.conf file using the “restore_command”.  It would amount to a some script that connect into your backups and pulls the requested WAL file.


When you say no firewall; that is bit confusing and I’m left assuming that the nodes are on the same subnet?  I normally only use replication slots with either a backup solution or a replia that is going over a WAN.  I am bit perplex why replication would fall that far behind on a local network (send lag not replay lag).  What is the interconnect; is it gigabit or 10g and what the volume of WALs being generated? Might have a network related issue here.

I haven’t used repmgr; thus I can’t help there.


Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Rui DeSousa
Дата:

> On Jun 27, 2018, at 6:02 AM, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:
>
> Hi all,
> it happened again. The weird thing is that when I query pg_stat_replication I see only one slave(the one that is
stillsynced) and I dont see the second one. Moreover,  I dont see anything in the repmgr log of the primary and in the
slaveregarding the disconnection... 
>

Right, once it drops out of replication it no longer shows up in pg_stat_replication — it is a live view of current
activeconnections. 

Does the replica report that the WAL has already been removed from primary?  If so most likely due to a spike in WAL
filegeneration where the setting wal_keep_segment is too low; However, I’m still confused about your setup.  Is the
replicaon the same subnet and what is the network usages? Any large activity happening on the system, i.e. index
rebuilds? How many WALs where generated when the replica dropped out? 






Re: Investigate postgres 9.6.3 repmgr lag 4.0.4

От
Jerry Sievers
Дата:
Mariel Cherkassky <mariel.cherkassky@gmail.com> writes:

> Hi all,
> it happened again. The weird thing is that when I query
> pg_stat_replication I see only one slave(the one that is still
> synced) and I dont see the second one. Moreover,  I dont see anything
> in the repmgr log of the primary and in the slave regarding the
> disconnection...

Disclaimer: I am not a repmgr admin...

I suggest you check the DB server logs on both standby node and master
for evidence of connection attempts for streaming, their success/failure
and reasons and/or what might cause an established connection to drop.

That you don't see the rogue standby in pg_stat_replication on the
master is clear evidence that it's not connected presently but this
doesn't tell the whole story.

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800