Обсуждение: Weird error when setting up streaming replication

Поиск
Список
Период
Сортировка

Weird error when setting up streaming replication

От
Quentin Hartman
Дата:
I'm going through all my usual steps for setting up streaming replication on a new pair of servers. Modify configs as appropriate, rsync data from master to slave, etc. I have this all automated with chef, and it has been pretty bulletproof for awhile. However, today, I ran into this when starting the slave on this new pair:

 * Starting PostgreSQL 9.2 database server                                                                                      * The PostgreSQL server failed to start. Please check the log output:
2013-08-08 23:47:30 GMT LOG:  database system was interrupted; last known up at 2013-08-08 23:22:40 GMT
2013-08-08 23:47:30 GMT LOG:  entering standby mode
2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is 5909892614333033983, pg_control database system identifier is 5909892824786287231.
2013-08-08 23:47:30 GMT LOG:  invalid primary checkpoint record
2013-08-08 23:47:30 GMT LOG:  invalid secondary checkpoint record
2013-08-08 23:47:30 GMT PANIC:  could not locate a valid checkpoint record
2013-08-08 23:47:30 GMT LOG:  startup process (PID 10600) was terminated by signal 6: Aborted
2013-08-08 23:47:30 GMT LOG:  aborting startup due to startup process failure


And I've been stumped. I've completely nuked my data dirs and started over and gotten the same result, but with different identifier numbers (as I would expect).

Any Ideas?

Thanks!

QH

Re: Weird error when setting up streaming replication

От
Michael Paquier
Дата:
On Fri, Aug 9, 2013 at 8:55 AM, Quentin Hartman
<qhartman@direwolfdigital.com> wrote:
> 2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
> 2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
> 5909892614333033983, pg_control database system identifier is
> 5909892824786287231.
It looks that you are not able to detect valid checkpoint records when
replaying WAL because your new system has been initialized with a
fresh initdb, symbolized by the errors above. You should build your
new node using a base backup or a snapshot of the data folder of the
node you are trying to replace.
--
Michael


Re: Weird error when setting up streaming replication

От
Quentin Hartman
Дата:
This pair of servers aren't replacing anything, they are new, empty servers. Before starting the slave at all, I'm copying the entire data filestructure over to it via rsync. I'm doing almost exactly what is described here: http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps . The only different is that I've tweaked the paths on the rsync to be appropriate to my system layout. I've even gone so far as to delete everything in the data dir except for the pg_xlog directory before syncing everything over to make sure it wasn't caused by something not getting overwritten when it was supposed to.


On Thu, Aug 8, 2013 at 6:23 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Aug 9, 2013 at 8:55 AM, Quentin Hartman
<qhartman@direwolfdigital.com> wrote:
> 2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
> 2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
> 5909892614333033983, pg_control database system identifier is
> 5909892824786287231.
It looks that you are not able to detect valid checkpoint records when
replaying WAL because your new system has been initialized with a
fresh initdb, symbolized by the errors above. You should build your
new node using a base backup or a snapshot of the data folder of the
node you are trying to replace.
--
Michael

Re: Weird error when setting up streaming replication

От
Quentin Hartman
Дата:
OK, figured this out. I had it start copying the pg_xlog directory as well when doing the initial sync. I realized this is also the first time I've setup replication from scratch using 9.2. All my other 9.2 pairs were setup on either 9.0 or 9.1, and have been upgraded from there with replication already in place. Previously, and still according to that article in the wiki, the pg_xlog directory was specifically excluded. Does anyone know why this behavior may have changed?


On Fri, Aug 9, 2013 at 9:33 AM, Quentin Hartman <qhartman@direwolfdigital.com> wrote:
This pair of servers aren't replacing anything, they are new, empty servers. Before starting the slave at all, I'm copying the entire data filestructure over to it via rsync. I'm doing almost exactly what is described here: http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps . The only different is that I've tweaked the paths on the rsync to be appropriate to my system layout. I've even gone so far as to delete everything in the data dir except for the pg_xlog directory before syncing everything over to make sure it wasn't caused by something not getting overwritten when it was supposed to.


On Thu, Aug 8, 2013 at 6:23 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Aug 9, 2013 at 8:55 AM, Quentin Hartman
<qhartman@direwolfdigital.com> wrote:
> 2013-08-08 23:47:30 GMT LOG:  WAL file is from different database system
> 2013-08-08 23:47:30 GMT DETAIL:  WAL file database system identifier is
> 5909892614333033983, pg_control database system identifier is
> 5909892824786287231.
It looks that you are not able to detect valid checkpoint records when
replaying WAL because your new system has been initialized with a
fresh initdb, symbolized by the errors above. You should build your
new node using a base backup or a snapshot of the data folder of the
node you are trying to replace.
--
Michael


Re: Weird error when setting up streaming replication

От
Jeff Janes
Дата:
On Fri, Aug 9, 2013 at 8:33 AM, Quentin Hartman
<qhartman@direwolfdigital.com> wrote:
> This pair of servers aren't replacing anything, they are new, empty servers.

That should be 'empty server', singular.

> Before starting the slave at all, I'm copying the entire data filestructure
> over to it via rsync. I'm doing almost exactly what is described here:
> http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps
> . The only different is that I've tweaked the paths on the rsync to be
> appropriate to my system layout. I've even gone so far as to delete
> everything in the data dir except for the pg_xlog directory before syncing
> everything over  to make sure it wasn't caused by something not getting
> overwritten when it was supposed to.

So then, you *are* replacing the slave server.  If you were not, there
would be nothing in its data dir to delete, and nothing there to get
overwritten (or not get overwritten).  Also, not deleting the pg_xlog
directory (or at least the contents of that directory) is exactly the
problem.

Cheers,

Jeff


Re: Weird error when setting up streaming replication

От
Jeff Janes
Дата:
On Fri, Aug 9, 2013 at 9:54 AM, Quentin Hartman
<qhartman@direwolfdigital.com> wrote:
> OK, figured this out. I had it start copying the pg_xlog directory as well
> when doing the initial sync. I realized this is also the first time I've
> setup replication from scratch using 9.2. All my other 9.2 pairs were setup
> on either 9.0 or 9.1, and have been upgraded from there with replication
> already in place. Previously, and still according to that article in the
> wiki, the pg_xlog directory was specifically excluded.

You exclude the pg_xlog in the rsync so as not to restore them,
because they are not needed and can cause confusion.  But, you don't
want an old copy of pg_xlog from a previous cluster sitting around,
either, which is the case you were having.

By including pg_xlog in the sync, what you were doing is overwriting
the old files from a previous cluster (which are toxic) with ones from
the master, which are useless, but at least not generally toxic.

I think one problem from the wiki is step 3:

3. Edit recovery.conf and postgresql.conf on the standby to start up
replication and hot standby. First, in postgresql.conf, change this
line

It doesn't tell you how you got those files in the first place, in
order to edit them.  You apparently got them from an initdb.  What you
probably want to do instead is get them by copying them from the
master.

> Does anyone know why
> this behavior may have changed?

I don't think it has changed.  I think you are interpretation of the
instructions has changed, so you did something different under 9.0 and
9.1.

Cheers,

Jeff


Re: Weird error when setting up streaming replication

От
pgdude
Дата:
I get the same "weird" errors (WAL file is from different database system)
too with Ubuntu and Postgresql 9.3 when setting up a slave using rsync.

1. I installed postgresql on the slave (which automatically does the
initdb):
   sudo apt-get install postgresql-9.3

2. Modified my postgresql.conf file
(/etc/postgresql/9.3/main/postgresql.conf) to make it a slave.  Did the same
thing for pg_hba.conf adding my replication user in there.

3. Stopped both master and slave.

4. Did the rsync from the master to the slave excluding pg_xlog (thereby
leaving the existing pg_xlog contents on the slave intact).

Then I get the same errors (WAL file is from different database system).

Now if I delete everything from the data directory on the slave, including
the pg_xlog directory, and then do the rsync excluding the pg_xlog
directory, the cluster won't start because the pg_xlog directory is not
there.

But if I rsync with the pg_xlog directory, then I do not get any more
messages in the log file, whether I had the installation data directory in
place, or I deleted everything from the data directory before the rsync.


So it seems in this version of Postgresql 9.3 on Ubuntu, you should NOT
exclude pg_xlog when rsyncin' the stuff over.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Weird-error-when-setting-up-streaming-replication-tp5766888p5808923.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.