Обсуждение: initial sync of multiple streaming slaves simultaneously

Поиск

Список

Период

Сортировка

initial sync of multiple streaming slaves simultaneously

От

Mike Roest

Дата:

19 сентября 2012 г., 15:59:17

Hey Everyone,

We currently have a 9.1.5 postgres cluster running using streaming replication. We have 3 nodes right now

2 - local that are setup with pacemaker for a HA master/slave set failover cluster

1 - remote as a DR.

Currently we're syncing with the pretty standard routine

clear local datadir

pg_start_backup

sync datadir with fast-archiver (https://github.com/replicon/fast-archiver)

pg_stop_backup

start slave

We use the streaming replication with wal_keep_segments set to 1000 to get the required WAL files to the slaves.

With this procedure we can currently only sync one of the slaves at a time if we failover. As when the second machine goes to start the sync it errors out cause trying to run pg_start_backup fails.

We're looking into was to allow both the slave and the DR to sync at the same time.

The procedure I'm currently testing is

clear localdatadir

pg_start_backup

scp datadir/backuplabel

pg_stop_backup

sync datadir with fast-archiver

start slave

This seems to be working and the slave comes up correctly and streams the WAL files it needs from the backup_label that was copied during the pg_start_backup/pg_stop_backup

Is there any hidden issue with this that we haven't seen. Or does anyone have suggestions as to an alternate procedure that will allow 2 slaves to sync concurrently.

Thanks

Re: initial sync of multiple streaming slaves simultaneously

От

Lonni J Friedman

Дата:

19 сентября 2012 г., 16:03:12

On Wed, Sep 19, 2012 at 8:59 AM, Mike Roest <mike.roest@replicon.com> wrote:
> Hey Everyone,
>     We currently have a 9.1.5 postgres cluster running using streaming
> replication.  We have 3 nodes right now
>
> 2 - local that are setup with pacemaker for a HA master/slave set failover
> cluster
> 1 - remote as a DR.
>
> Currently we're syncing with the pretty standard routine
>
> clear local datadir
> pg_start_backup
> sync datadir with fast-archiver (https://github.com/replicon/fast-archiver)
> pg_stop_backup
> start slave
>
> We use the streaming replication with wal_keep_segments set to 1000 to get
> the required WAL files to the slaves.
>
> With this procedure we can currently only sync one of the slaves at a time
> if we failover.  As when the second machine goes to start the sync it errors
> out cause trying to run pg_start_backup fails.

Specifically what is the error?

Re: initial sync of multiple streaming slaves simultaneously

От

Mike Roest

Дата:

19 сентября 2012 г., 16:05:18

Specifically what is the error?

psql (9.1.5)
Type "help" for help.

postgres=# select pg_start_backup('hotbackup',true);
pg_start_backup
-----------------
61/B000020
(1 row)

postgres=# select pg_start_backup('hotbackup',true);
ERROR: a backup is already in progress
HINT: Run pg_stop_backup() and try again.
postgres=#

Re: initial sync of multiple streaming slaves simultaneously

От

Mike Roest

Дата:

19 сентября 2012 г., 16:35:23

Our sync script is setup to fail if the pg_start_backup fails as if it fails for some other reason the sync won't be valid as the backup_label file will be missing so the slave won't have the correct location to restart from.

Originally I had gone down the road of changing the sync script such that if the pg_start_backup failed and the backup_label file existed it would sync the backup_label right away so it could then do the sync. It was also setup so that if it didn't start the backup it wouldn't stop the backup. This however didn't work as if the DR starts the backup and begins it sync first, and the local slave then goes to startup and the backup is already in progress it would complete the sync faster then the DR and then try to start up. But the local slave would not come up into hot standby until the stop_backup was executed (it came up but would never switch over to allow readonly queries).

At that point I was going to change the script to basically be whoever got to the point of needing to stop the backup first would call stop backup. But the new procedure of calling start and then stop right away seems simpler (it makes the slave startup script simpler for sure).

On Wed, Sep 19, 2012 at 10:05 AM, Mike Roest <mike.roest@replicon.com> wrote:

Specifically what is the error?
psql (9.1.5)
Type "help" for help.

postgres=# select pg_start_backup('hotbackup',true);
pg_start_backup
-----------------
61/B000020
(1 row)

postgres=# select pg_start_backup('hotbackup',true);
ERROR: a backup is already in progress
HINT: Run pg_stop_backup() and try again.
postgres=#

Re: initial sync of multiple streaming slaves simultaneously

От

Mike Roest

Дата:

19 сентября 2012 г., 19:27:10

Is there any hidden issue with this that we haven't seen. Or does anyone have suggestions as to an alternate procedure that will allow 2 slaves to sync concurrently.

With some more testing I've done today I seem to have found an issue with this procedure.

When the slave starts up after the sync It reaches what it thinks is a consistent recovery point very fast based on the pg_stop_backup

eg:

(from the recover script)

2012-09-19 12:15:02: pgsql_start start

2012-09-19 12:15:31: pg_start_backup

2012-09-19 12:15:31: -----------------

2012-09-19 12:15:31: 61/30000020

2012-09-19 12:15:31: (1 row)

2012-09-19 12:15:31:

2012-09-19 12:15:32: NOTICE: pg_stop_backup complete, all required WAL segments have been archived

2012-09-19 12:15:32: pg_stop_backup

2012-09-19 12:15:32: ----------------

2012-09-19 12:15:32: 61/300000D8

2012-09-19 12:15:32: (1 row)

2012-09-19 12:15:32:

While the sync was running (but after the pg_stop_backup) I pushed a bunch of traffic against the master server. Which got me to a current xlog location of

postgres=# select pg_current_xlog_location();

pg_current_xlog_location

--------------------------

61/6834C450

(1 row)

The startup of the slave after the sync completed:

2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG: database system was interrupted; last known up at 2012-09-19 12:15:31 MDT

2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG: creating missing WAL directory "pg_xlog/archive_status"

2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG: entering standby mode

2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG: streaming replication successfully connected to primary

2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG: redo starts at 61/30000020

2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG: consistent recovery state reached at 61/31000000

2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG: database system is ready to accept read only connections

It shows the DB reached a consistent state as of 61/31000000 which is well behind the current location of the master (and the data files that were synced over to the slave). And monitoring the server showed the expected slave delay that disappeared as the slave pulled and recovered from the WAL files that go generated after the pg_stop_backup.

But based on this it looks like this procedure would end up with a indeterminate amount of time (based on how much traffic the master processed while the slave was syncing) that the slave couldn't be trusted for fail over or querying as the server is up and running but is not actually in a consistent state.

Thinking it through the more complicated script version of the 2 server recovery (where first past the post to run start_backup or stop_backup) would also have this issue (although our failover slave would always be the one running stop backup as it syncs faster so at least it would be always consistent but the DR would still have the problem)

Re: initial sync of multiple streaming slaves simultaneously

От

Lonni J Friedman

Дата:

19 сентября 2012 г., 19:34:25

Just curious, is there a reason why you can't use pg_basebackup ?

On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest <mike.roest@replicon.com> wrote:
>
>> Is there any hidden issue with this that we haven't seen.  Or does anyone
>> have suggestions as to an alternate procedure that will allow 2 slaves to
>> sync concurrently.
>>
> With some more testing I've done today I seem to have found an issue with
> this procedure.
> When the slave starts up after the sync It reaches what it thinks is a
> consistent recovery point very fast based on the pg_stop_backup
>
> eg:
> (from the recover script)
> 2012-09-19 12:15:02: pgsql_start start
> 2012-09-19 12:15:31: pg_start_backup
> 2012-09-19 12:15:31: -----------------
> 2012-09-19 12:15:31: 61/30000020
> 2012-09-19 12:15:31: (1 row)
> 2012-09-19 12:15:31:
> 2012-09-19 12:15:32: NOTICE:  pg_stop_backup complete, all required WAL
> segments have been archived
> 2012-09-19 12:15:32: pg_stop_backup
> 2012-09-19 12:15:32: ----------------
> 2012-09-19 12:15:32: 61/300000D8
> 2012-09-19 12:15:32: (1 row)
> 2012-09-19 12:15:32:
>
> While the sync was running (but after the pg_stop_backup) I pushed a bunch
> of traffic against the master server.  Which got me to a current xlog
> location of
> postgres=# select pg_current_xlog_location();
>  pg_current_xlog_location
> --------------------------
>  61/6834C450
> (1 row)
>
> The startup of the slave after the sync completed:
> 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG:  database system was
> interrupted; last known up at 2012-09-19 12:15:31 MDT
> 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG:  creating missing WAL
> directory "pg_xlog/archive_status"
> 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG:  entering standby mode
> 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG:  streaming replication
> successfully connected to primary
> 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG:  redo starts at 61/30000020
> 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG:  consistent recovery state
> reached at 61/31000000
> 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG:  database system is ready to
> accept read only connections
>
> It shows the DB reached a consistent state as of 61/31000000 which is well
> behind the current location of the master (and the data files that were
> synced over to the slave).  And monitoring the server showed the expected
> slave delay that disappeared as the slave pulled and recovered from the WAL
> files that go generated after the pg_stop_backup.
>
> But based on this it looks like this procedure would end up with a
> indeterminate amount of time (based on how much traffic the master processed
> while the slave was syncing) that the slave couldn't be trusted for fail
> over or querying as the server is up and running but is not actually in a
> consistent state.
>
> Thinking it through the more complicated script version of the 2 server
> recovery (where first past the post to run start_backup or stop_backup)
> would also have this issue (although our failover slave would always be the
> one running stop backup as it syncs faster so at least it would be always
> consistent but the DR would still have the problem)

Re: initial sync of multiple streaming slaves simultaneously

От

Mike Roest

Дата:

19 сентября 2012 г., 19:51:56

Performance.

On our production DB the fast-archiver transfers the datadir in about half as much time as basebackup.

And since this happens on every failover (since clearing the datadir and resyncing as if from scratch also takes about half the time as a rsync of an existing datadir)

--Mike

On Wed, Sep 19, 2012 at 1:34 PM, Lonni J Friedman <netllama@gmail.com> wrote:

Just curious, is there a reason why you can't use pg_basebackup ?

On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest <mike.roest@replicon.com> wrote:
>
>> Is there any hidden issue with this that we haven't seen. Or does anyone
>> have suggestions as to an alternate procedure that will allow 2 slaves to
>> sync concurrently.
>>
> With some more testing I've done today I seem to have found an issue with
> this procedure.
> When the slave starts up after the sync It reaches what it thinks is a
> consistent recovery point very fast based on the pg_stop_backup
>
> eg:
> (from the recover script)
> 2012-09-19 12:15:02: pgsql_start start
> 2012-09-19 12:15:31: pg_start_backup
> 2012-09-19 12:15:31: -----------------
> 2012-09-19 12:15:31: 61/30000020
> 2012-09-19 12:15:31: (1 row)
> 2012-09-19 12:15:31:
> 2012-09-19 12:15:32: NOTICE: pg_stop_backup complete, all required WAL
> segments have been archived
> 2012-09-19 12:15:32: pg_stop_backup
> 2012-09-19 12:15:32: ----------------
> 2012-09-19 12:15:32: 61/300000D8
> 2012-09-19 12:15:32: (1 row)
> 2012-09-19 12:15:32:
>
> While the sync was running (but after the pg_stop_backup) I pushed a bunch
> of traffic against the master server. Which got me to a current xlog
> location of
> postgres=# select pg_current_xlog_location();
> pg_current_xlog_location
> --------------------------
> 61/6834C450
> (1 row)
>
> The startup of the slave after the sync completed:
> 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG: database system was
> interrupted; last known up at 2012-09-19 12:15:31 MDT
> 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG: creating missing WAL
> directory "pg_xlog/archive_status"
> 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG: entering standby mode
> 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG: streaming replication
> successfully connected to primary
> 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG: redo starts at 61/30000020
> 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG: consistent recovery state
> reached at 61/31000000
> 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG: database system is ready to
> accept read only connections
>
> It shows the DB reached a consistent state as of 61/31000000 which is well
> behind the current location of the master (and the data files that were
> synced over to the slave). And monitoring the server showed the expected
> slave delay that disappeared as the slave pulled and recovered from the WAL
> files that go generated after the pg_stop_backup.
>
> But based on this it looks like this procedure would end up with a
> indeterminate amount of time (based on how much traffic the master processed
> while the slave was syncing) that the slave couldn't be trusted for fail
> over or querying as the server is up and running but is not actually in a
> consistent state.
>
> Thinking it through the more complicated script version of the 2 server
> recovery (where first past the post to run start_backup or stop_backup)
> would also have this issue (although our failover slave would always be the
> one running stop backup as it syncs faster so at least it would be always
> consistent but the DR would still have the problem)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: initial sync of multiple streaming slaves simultaneously

initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously

Re: initial sync of multiple streaming slaves simultaneously