replication terminated by primary server

Поиск

Список

Период

Сортировка

От	Bruyninckx Kristof
Тема	replication terminated by primary server
Дата	6 декабря 2017 г. 19:07:07
Msg-id	2e68e11d019147508ba6c155efdfdd35@SVINTMAIL03.cegekanv.corp.local обсуждение исходный текст
Ответы	Re: replication terminated by primary server
Список	pgsql-general

Дерево обсуждения

In our environment, we have a master slave replication setup that has been working stable for the last year.

Host systems are debian Jessie and we are using postgres 9.4.

Now recently we have experienced a crash/hung master, and after restarting the postgress services on here the replication stopped working. The master however is running seemingly normal, except for the errors reported when it got restarted. After this nothing error related is reported.

[10192-1] [unknown]@[unknown] LOG: incomplete startup packet

[10222-1] [unknown]@[unknown] LOG: incomplete startup packet

[10033-2] LOG: replication terminated by primary server

[10033-3] DETAIL: End of WAL reached on timeline 2 at 999/A5687790.

[1082-12] LOG: invalid record length at 999/A5687790

[10239-1] LOG: started streaming WAL from primary at 999/A5000000 on timeline 2

[1064-7] LOG: startup process (PID 1082) exited with exit code 1

[1064-8] LOG: terminating any other active server processes

[18749-1] readonly@pal WARNING: terminating connection because of crash of another server process

[25793-1] _readonly@pal WARNING: terminating connection because of crash of another server process

After a recent crash of the postgres master I'm not able to get the slave to start replicating.

I always get the following error message

13247-2] HINT: Future log output will go to log destination "syslog".

[13247-3] LOCATION: PostmasterMain, postmaster.c:1228

[13248-1] LOG: 00000: database system was interrupted while in recovery at log time 2017-12-04 15:10:29 CET

[13248-2] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

[13248-3] LOCATION: StartupXLOG, xlog.c:6134

[13248-4] LOG: 00000: entering standby mode

[13248-5] LOCATION: StartupXLOG, xlog.c:6203

[13247-4] LOG: 00000: startup process (PID 13248) exited with exit code 1

[13247-5] LOCATION: LogChildExit, postmaster.c:3452

[13247-6] LOG: 00000: aborting startup due to startup process failure

I’ve already tried to perform a complete backup and resync procedure on the slave

pg_basebackup -D /var/lib/postgresql/backups/fullbackup -R -h <IP> --checkpoint=fast --username=<username> --xlog-method=stream

Which completes without any error message. The odd thing is that the backup folder does already contains a recovery.done file. When I do the same command on a test platform this recovery.done is not created.

But the test is using 9.5. Not sure it is related.

Also the recovery.conf contains all the information is should but still the error message stays the same.

cat recovery.conf

recovery_target_timeline='latest'

standby_mode = 'on'

primary_conninfo = 'user=<user> password=<passwd> host=IP port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'

Does this mean that the corruption is on the master system and it needs to be restored to a point before it crashed ? Not sure what I can do to get the replication working again ?

Any ideas ?

Kind Regards,

Kristof

Met vriendelijke groeten / Meilleures salutations / Best regards

Kristof Bruyninckx
System Engineer