Обсуждение: [ADMIN] Problems rebuilding slave using pg_basebackup
Hi
Sorry if this email was aready received but I sent it originally from my own email address
but received no response from the moderator so I assume that it may have got caught in the
filter.
We are having a number of problems when we attempt to rebuild our slave from its master
We have made about three attempts without success (using a proven set of notes)
It's been rebuilt several times over the last few months although the time between
pg_basebackup being keyed and it actually copying data can be up to six minutes.
And after completion the time taken from database startup to psql availability
can also be several minutes while it processes any remaining logs.
Both machines are virtuals and are based with a leading cloud provider
OS Linux Centos6 (6.8 Final)
pg version 9.5.4
pg WAL settings on the master database
max_wal_senders = 5
max_wal_size = 4GB
min_wal_size = 256MB
wal_block_size = 8192
wal_buffers = 1MB
wal_compression = off
wal_keep_segments = 32
wal_level = hot_standby
wal_log_hints = off
wal_receiver_status_interval = 10s
wal_receiver_timeout = 1min
wal_retrieve_retry_interval = 5s
wal_segment_size = 16MB
wal_sender_timeout = 1min
wal_sync_method = fdatasync
wal_writer_delay = 200ms
Message from pg_basebackup
[postgres@xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D /var/lib/pgsql/9.5/data -P -U postgres --xlog-method=stream
pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
269061959/269164935 kB (99%), 1/1 tablespace
pg_basebackup: child process exited with error 1
Relevant error messages from master's log
Nov 7 11:52:32 o8-data1 postgres[28558]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=41498
Nov 7 11:52:32 o8-data1 postgres[28558]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: replication connection authorized: user=postgres
Nov 7 13:51:44 o8-data1 postgres[28558]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could not send data to client: Broken pipe
Nov 7 13:51:44 o8-data1 postgres[28558]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base backup could not send data, aborting backup
Nov 7 13:51:44 o8-data1 postgres[28558]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL: connection to client lost
Nov 7 13:51:44 o8-data1 postgres[28558]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: disconnection: session time: 1:59:11.943 user=postgres database= host=-IP_HIDDEN- port=41498
Nov 7 13:54:48 o8-data1 postgres[35445]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=44040
Nov 7 13:54:48 o8-data1 postgres[35445]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: replication connection authorized: user=postgres
Nov 7 15:09:20 o8-data1 postgres[35445]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could not send data to client: Broken pipe
Nov 7 15:09:20 o8-data1 postgres[35445]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base backup could not send data, aborting backup
Nov 7 15:09:20 o8-data1 postgres[35445]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL: connection to client lost
Nov 7 15:09:20 o8-data1 postgres[35445]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: disconnection: session time: 1:14:31.925 user=postgres database= host=-IP_HIDDEN- port=44040
Many thanks in advance
Douglas Reed
DBA
FSB Technology
On Nov 8, 2017 5:59 AM, "Douglas Reed" <douglas@fsbtech.com> wrote:
HiSorry if this email was aready received but I sent it originally from my own email addressbut received no response from the moderator so I assume that it may have got caught in thefilter.We are having a number of problems when we attempt to rebuild our slave from its masterWe have made about three attempts without success (using a proven set of notes)It's been rebuilt several times over the last few months although the time betweenpg_basebackup being keyed and it actually copying data can be up to six minutes.
Try setting checkpoint mode to fast in the pg_basebackup command. (-c fast) so it won't wait passively for a checkpoint before beginning basebackup.
And after completion the time taken from database startup to psql availabilitycan also be several minutes while it processes any remaining logs.
Based on how busy your primary is, this is expected. What is the WAL generation rate approximately for your database?
Both machines are virtuals and are based with a leading cloud provider
Have you checked performance metrics like IO, CPU load, etc? Usually you will be able to view some basic metics out of the box.
OS Linux Centos6 (6.8 Final)pg version 9.5.4
Quite a few pg_basebackup bugs were fixed in the later minor versions, especially 9.5.6:
Fix pg_basebackup's rate limiting in the presence of slow I/O (Antonin Houska)
Fix possible pg_basebackup failure on standby server when including WAL files (Amit Kapila, Robert Haas)
Always recommend keeping minor version up to date (9.5.9 is the latest) since it just needs a quick restart of the database. Won't be surprised if this alone fixes your issue.
pg WAL settings on the master databasemax_wal_senders = 5max_wal_size = 4GBmin_wal_size = 256MBwal_block_size = 8192wal_buffers = 1MBwal_compression = offwal_keep_segments = 32wal_level = hot_standbywal_log_hints = offwal_receiver_status_interval = 10swal_receiver_timeout = 1minwal_retrieve_retry_interval = 5swal_segment_size = 16MBwal_sender_timeout = 1minwal_sync_method = fdatasyncwal_writer_delay = 200msMessage from pg_basebackup[postgres@xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D /var/lib/pgsql/9.5/data -P -U postgres --xlog-method=streampg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing the request.269061959/269164935 kB (99%), 1/1 tablespacepg_basebackup: child process exited with error 1Relevant error messages from master's logNov 7 11:52:32 o8-data1 postgres[28558]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=41498 Nov 7 11:52:32 o8-data1 postgres[28558]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: replication connection authorized: user=postgres Nov 7 13:51:44 o8-data1 postgres[28558]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: could not send data to client: Broken pipe Nov 7 13:51:44 o8-data1 postgres[28558]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- ERROR: base backup could not send data, aborting backup Nov 7 13:51:44 o8-data1 postgres[28558]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- FATAL: connection to client lost Nov 7 13:51:44 o8-data1 postgres[28558]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: disconnection: session time: 1:59:11.943 user=postgres database= host=-IP_HIDDEN- port=41498 Nov 7 13:54:48 o8-data1 postgres[35445]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=44040 Nov 7 13:54:48 o8-data1 postgres[35445]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: replication connection authorized: user=postgres Nov 7 15:09:20 o8-data1 postgres[35445]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: could not send data to client: Broken pipe Nov 7 15:09:20 o8-data1 postgres[35445]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- ERROR: base backup could not send data, aborting backup Nov 7 15:09:20 o8-data1 postgres[35445]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- FATAL: connection to client lost Nov 7 15:09:20 o8-data1 postgres[35445]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: disconnection: session time: 1:14:31.925 user=postgres database= host=-IP_HIDDEN- port=44040 Many thanks in advance--Douglas ReedDBAFSB Technology
Try increasing wal_sender_timeout before running pg_basebackup.
Also, if you are sending/storing WAL files anywhere besides the master, once your pg_basebackup command fails, try copying those missing files manually to path given in restore_command parameter in the secondary's recovery.conf.
A --slot option was added to pg_basebackup in 9.6 so the command using -x stream could connect to the replication slot used by secondary on the master to make sure no way files go missing.
Douglas Reed wrote: > We are having a number of problems when we attempt to rebuild our slave from its master > > We have made about three attempts without success (using a proven set of notes) > > It's been rebuilt several times over the last few months although the time between > pg_basebackup being keyed and it actually copying data can be up to six minutes. > And after completion the time taken from database startup to psql availability > can also be several minutes while it processes any remaining logs. > > Both machines are virtuals and are based with a leading cloud provider > > OS Linux Centos6 (6.8 Final) > > pg version 9.5.4 [...] > Message from pg_basebackup > > [postgres@xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D /var/lib/pgsql/9.5/data -P -U postgres --xlog-method=stream > pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 269061959/269164935 kB (99%), 1/1 tablespace > pg_basebackup: child process exited with error 1 > > > Relevant error messages from master's log > > Nov 7 11:52:32 o8-data1 postgres[28558]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: connectionreceived: host=-IP_HIDDEN- port=41498 > Nov 7 11:52:32 o8-data1 postgres[28558]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: replicationconnection authorized: user=postgres > Nov 7 13:51:44 o8-data1 postgres[28558]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could not send data to client: Broken pipe > Nov 7 13:51:44 o8-data1 postgres[28558]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base backup could not send data, aborting backup > Nov 7 13:51:44 o8-data1 postgres[28558]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL: connection to client lost > Nov 7 13:51:44 o8-data1 postgres[28558]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: disconnection: session time: 1:59:11.943 user=postgres database= host=-IP_HIDDEN- port=41498 > > Nov 7 13:54:48 o8-data1 postgres[35445]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: connectionreceived: host=-IP_HIDDEN- port=44040 > Nov 7 13:54:48 o8-data1 postgres[35445]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG: replicationconnection authorized: user=postgres > Nov 7 15:09:20 o8-data1 postgres[35445]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could not send data to client: Broken pipe > Nov 7 15:09:20 o8-data1 postgres[35445]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base backup could not send data, aborting backup > Nov 7 15:09:20 o8-data1 postgres[35445]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL: connection to client lost > Nov 7 15:09:20 o8-data1 postgres[35445]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: disconnection: session time: 1:14:31.925 user=postgres database= host=-IP_HIDDEN- port=44040 Both client and server complain that the peer has gone away. That means that the network connection got interrupted. Either you do not have a reliable TCP connection, or some in-between firewall terminates the connection. Yours, Laurenz Albe -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin
Many thanks for your response
Upgrading may not be possible at present as the master is a 24/7 operation, but if anyone knows whether master and slave can be different minor releases (9.5.4 vs 9.5.6) would be helpful
What is your archive_command and full_page_writes set to? Also, what is the value of checkpoint_segments and checkpoint_timeout?
checkpoint_segments - Not found in params, could this be
wal_keep_segments = 48
checkpoint_timeout = 5min
archive_command = disabled
full_page_writes = on
Try increasing wal_sender_timeout before running pg_basebackup.
I have just increased the value from 1min to 5min
I'm looking also at other backup options such as barman
Regards
Douglas Reed
DBA
FSB Technology
07973-132664