Обсуждение: PG 9.3.12: Replication appears to have worked, but getting error messages in logs
PG 9.3.12: Replication appears to have worked, but getting error messages in logs
От
David Caldwell
Дата:
Hello, We're using streaming replication. Our technique for spinning up a db slave is this: rsync from master (gross copy) pg_start_backup() on server rsync from master (correct copy) pg_stop_backup() drop in recovery.conf into slave directory enable hot_standby in slave conf start slave After starting the slave, I'm getting this error every 5 seconds in the log: ERROR: requested WAL segment 0000000100000E2200000005 has already been removed But I can connect to the DB and make queries and new records are definitely streaming in. I thought I just didn't have enough wal segments, so I bumped up the number on the master and restarted the process. It just finished the second time and the exact same error message is in the logs again (same wal segment number). When I ran pg_start_backup() and pg_stop_backup(), the output was: pg_start_backup ----------------- E27/3100A200 pg_stop_backup ---------------- E28/7D357950 The backup_label file looked like this: START WAL LOCATION: E27/3100A200 (file 0000000100000E2700000031) CHECKPOINT LOCATION: E27/31C9C740 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2016-04-02 12:34:25 PDT LABEL: clone During the rsync it copied pg_xlog/0000000100000E25000000F1 to pg_xlog/0000000100000E2800000071 So I'm confused: why is the E22 wal being requested? It seems to predate the backup by a lot. Does the slave really contain all the data? If not, how can I tell what is missing (and why is it accepting streaming data if it's missing something)? One more piece of the puzzle that may or may not be relevant: The current master used to be streaming replication slave. The original master had a disk failure and so we switched one of the backup slaves into a master. We've replaced the disk on the original server and we're now trying to make it a streaming replication slave. This is the part that's failing. If I do rough estimates of how fast the Exx number is incrementing and compute backwards, E22 seems like about the time of the original disk failure, give or take. Thanks, David
Вложения
Re: PG 9.3.12: Replication appears to have worked, but getting error messages in logs
От
Michael Paquier
Дата:
On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell <david@porkrind.org> wrote: > Hello, > > We're using streaming replication. Our technique for spinning up a db > slave is this: > > rsync from master (gross copy) > pg_start_backup() on server > rsync from master (correct copy) > pg_stop_backup() > drop in recovery.conf into slave directory > enable hot_standby in slave conf > start slave > > After starting the slave, I'm getting this error every 5 seconds in the log: > > ERROR: requested WAL segment 0000000100000E2200000005 has already been > removed Are you sure that this is from the slave and not another not requesting it? If this log entry has just ERROR, it means that the WAL sender generated it. So don't you have a cascading slave connecting to this new slave and requested for this WAL segment that has already been removed? Should the error come from the WAL receiver, you would have something like that. FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000001C has already been removed -- Michael
Re: PG 9.3.12: Replication appears to have worked, but getting error messages in logs
От
David Caldwell
Дата:
On 4/4/16 10:57 PM, Michael Paquier wrote: > On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell <david@porkrind.org> wrote: >> Hello, >> >> We're using streaming replication. Our technique for spinning up a db >> slave is this: >> >> rsync from master (gross copy) >> pg_start_backup() on server >> rsync from master (correct copy) >> pg_stop_backup() >> drop in recovery.conf into slave directory >> enable hot_standby in slave conf >> start slave >> >> After starting the slave, I'm getting this error every 5 seconds in the log: >> >> ERROR: requested WAL segment 0000000100000E2200000005 has already been >> removed > > Are you sure that this is from the slave and not another not > requesting it? If this log entry has just ERROR, it means that the WAL > sender generated it. So don't you have a cascading slave connecting to > this new slave and requested for this WAL segment that has already > been removed? Aha! Yes! There was a secondary slave still trying to replicate from the original master (now turned into a slave). > Should the error come from the WAL receiver, you would have something like that. > FATAL: could not receive data from WAL stream: ERROR: requested WAL > segment 00000001000000000000001C has already been removed I see the difference in the error messages now and it all makes sense. Turning off that secondary slave causes the messages to stop. So it looks like everything is good. Thanks! -David