Обсуждение: PG 9.3.12: Replication appears to have worked, but getting error messages in logs

Поиск
Список
Период
Сортировка

PG 9.3.12: Replication appears to have worked, but getting error messages in logs

От
David Caldwell
Дата:
Hello,

We're using streaming replication. Our technique for spinning up a db
slave is this:

rsync from master (gross copy)
pg_start_backup() on server
rsync from master (correct copy)
pg_stop_backup()
drop in recovery.conf into slave directory
enable hot_standby in slave conf
start slave

After starting the slave, I'm getting this error every 5 seconds in the log:

ERROR:  requested WAL segment 0000000100000E2200000005 has already been
removed

But I can connect to the DB and make queries and new records are
definitely streaming in.

I thought I just didn't have enough wal segments, so I bumped up the
number on the master and restarted the process. It just finished the
second time and the exact same error message is in the logs again (same
wal segment number).


When I ran pg_start_backup() and pg_stop_backup(), the output was:

 pg_start_backup
-----------------
 E27/3100A200

 pg_stop_backup
----------------
 E28/7D357950


The backup_label file looked like this:

START WAL LOCATION: E27/3100A200 (file 0000000100000E2700000031)
CHECKPOINT LOCATION: E27/31C9C740
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-04-02 12:34:25 PDT
LABEL: clone


During the rsync it copied
  pg_xlog/0000000100000E25000000F1
to
  pg_xlog/0000000100000E2800000071


So I'm confused: why is the E22 wal being requested? It seems to predate
the backup by a lot.

Does the slave really contain all the data? If not, how can I tell what
is missing (and why is it accepting streaming data if it's missing
something)?


One more piece of the puzzle that may or may not be relevant:

The current master used to be streaming replication slave. The original
master had a disk failure and so we switched one of the backup slaves
into a master. We've replaced the disk on the original server and we're
now trying to make it a streaming replication slave. This is the part
that's failing. If I do rough estimates of how fast the Exx number is
incrementing and compute backwards, E22 seems like about the time of the
original disk failure, give or take.


Thanks,
  David



Вложения

Re: PG 9.3.12: Replication appears to have worked, but getting error messages in logs

От
Michael Paquier
Дата:
On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell <david@porkrind.org> wrote:
> Hello,
>
> We're using streaming replication. Our technique for spinning up a db
> slave is this:
>
> rsync from master (gross copy)
> pg_start_backup() on server
> rsync from master (correct copy)
> pg_stop_backup()
> drop in recovery.conf into slave directory
> enable hot_standby in slave conf
> start slave
>
> After starting the slave, I'm getting this error every 5 seconds in the log:
>
> ERROR:  requested WAL segment 0000000100000E2200000005 has already been
> removed

Are you sure that this is from the slave and not another not
requesting it? If this log entry has just ERROR, it means that the WAL
sender generated it. So don't you have a cascading slave connecting to
this new slave and requested for this WAL segment that has already
been removed?

Should the error come from the WAL receiver, you would have something like that.
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment 00000001000000000000001C has already been removed
--
Michael


Re: PG 9.3.12: Replication appears to have worked, but getting error messages in logs

От
David Caldwell
Дата:
On 4/4/16 10:57 PM, Michael Paquier wrote:
> On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell <david@porkrind.org> wrote:
>> Hello,
>>
>> We're using streaming replication. Our technique for spinning up a db
>> slave is this:
>>
>> rsync from master (gross copy)
>> pg_start_backup() on server
>> rsync from master (correct copy)
>> pg_stop_backup()
>> drop in recovery.conf into slave directory
>> enable hot_standby in slave conf
>> start slave
>>
>> After starting the slave, I'm getting this error every 5 seconds in the log:
>>
>> ERROR:  requested WAL segment 0000000100000E2200000005 has already been
>> removed
>
> Are you sure that this is from the slave and not another not
> requesting it? If this log entry has just ERROR, it means that the WAL
> sender generated it. So don't you have a cascading slave connecting to
> this new slave and requested for this WAL segment that has already
> been removed?

Aha! Yes! There was a secondary slave still trying to replicate from the
original master (now turned into a slave).

> Should the error come from the WAL receiver, you would have something
like that.
> FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 00000001000000000000001C has already been removed

I see the difference in the error messages now and it all makes sense.
Turning off that secondary slave causes the messages to stop. So it
looks like everything is good.

Thanks!

-David



Вложения