Обсуждение: Streaming Replication Error
Hi Experts,
I have set up streaming replication on PostgreSQL 12.1
The master and slave are configured as below and WAL files are accumulating on the master.
However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER
MASTER
postgres@srvm:~$ 2019-12-20 16:35:07.910 CET [1334] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:12.920 CET [1338] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:17.925 CET [1340] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:22.932 CET [1362] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:27.935 CET [1364] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:32.942 CET [1365] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:37.948 CET [1366] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:42.954 CET [1367] replicator@[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
SLAVE
postgres@srvs:~$ 2019-12-20 16:36:53.027 CET [21978] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:36:53.027 CET [21978] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:36:58.029 CET [21979] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:36:58.029 CET [21979] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:37:03.040 CET [21980] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:37:03.040 CET [21980] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:37:08.042 CET [21981] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:37:08.042 CET [21981] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
Then the pg_basebackup is run and the slave started.
The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.
max_wal_senders = 10
wal_keep_segments = 120
What have I mis-configured? Do we need to enabled archive_mode = on for streaming replication?
Hi Experts,
I have set up streaming replication on PostgreSQL 12.1
The master and slave are configured as below and WAL files are accumulating on the master.
However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER
SLAVE
...
2019-12-20 16:37:08.042 CET [21981] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
Then the pg_basebackup is run and the slave started.
The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.
max_wal_senders = 10
wal_keep_segments = 120
What have I mis-configured?
Do we need to enabled archive_mode = on for streaming replication?
Thanks Jeff for your valuable inputs.
The setup-
MASTER
1. I did the below –
psql -c "ALTER SYSTEM SET listen_addresses TO '*'";
2. Created the user needed for replication
3. Updated the pg_hba.conf
echo "host replication replicator IP Address/32 md5" >> /etc/postgresql/12/main/pg_hba.conf
psql -c "select pg_reload_conf()"
SLAVE
1. Stopped the postgres on slave
2. rm -rf /var/lib/postgresql/12/main/*
3. I ran the below one in Slave once all needed setups were done
pg_basebackup -h IP Address -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R
4. Started the PostgreSQL using pg_ctl on the Slave.
- Taken backup using pg_dump from a different instance.
- Did a pg_restore on MASTER (The existing instance - MASTER)
- The SLAVE has all the data after the restore was done in MASTER – both were in sync.
- But after that any new objects created on MASTER was not reflected on SLAVE the errors started coming in.
Please let me know if I did something wrong above step highlighted ? Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?
Thanks
From: Jeff Janes <jeff.janes@gmail.com>
Sent: vrijdag 20 december 2019 17:45
To: Bhattacharjee, Soumik <soumik.bhattacharjee@kpn.com>
Cc: pgsql-admin@lists.postgresql.org; pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: Streaming Replication Error
On Fri, Dec 20, 2019 at 11:08 AM <soumik.bhattacharjee@kpn.com> wrote:
Hi Experts,
I have set up streaming replication on PostgreSQL 12.1
The master and slave are configured as below and WAL files are accumulating on the master.
However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER
You ran pg_restore on the master? Why did you do that? Doesn't that mean you now have a new master, different from the old master? Did you restore into a new instance, or into the existing instance? What command line did you use?
SLAVE
...
2019-12-20 16:37:08.042 CET [21981] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
Then the pg_basebackup is run and the slave started.
Presumably you just didn't restart the slave. You blew away the old one entirely, and started the new one created by the pg_basebackup.
What was the command line used for pg_basebackup?
The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.
This is confusing. You got the above errors before you did the new pg_basebackup, or after?
max_wal_senders = 10
wal_keep_segments = 120
What have I mis-configured?
Your email doesn't seem to be written in chronological order, and you didn't include the parameters for the commands you ran, so it is hard to say what you did wrong, as we don't know what you did.
It could be that your replica is seeking files from the wrong master.
Do we need to enabled archive_mode = on for streaming replication?
That can sometimes be useful, but it is not necessary. You can use a replication slot to force the master to retain sufficient logs, or you can set wal_keep_segments high enough that it (probably) keeps enough on its own.
One reason I sometimes find archive_mode to be useful in a streaming setup is that I can inject a compression step. WAL files compress very well, and you don't get that compression with a streaming connection. So if you fall behind and have a slow network, you can catch up much faster using a compressed archive.
Cheers,
Jeff
- Did a pg_restore on MASTER (The existing instance - MASTER)
- The SLAVE has all the data after the restore was done in MASTER – both were in sync.
Please let me know if I did something wrong above step highlighted ?
Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?
Thanks Jeff very well explained.
Thanks and have nice weekend J
From: Jeff Janes <jeff.janes@gmail.com>
Sent: vrijdag 20 december 2019 21:38
To: Bhattacharjee, Soumik <soumik.bhattacharjee@kpn.com>
Cc: pgsql-admin@lists.postgresql.org; pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: Streaming Replication Error
On Fri, Dec 20, 2019 at 2:12 PM <soumik.bhattacharjee@kpn.com> wrote:
- Did a pg_restore on MASTER (The existing instance - MASTER)
- The SLAVE has all the data after the restore was done in MASTER – both were in sync.
How did you determine that it had all the data? I suspect that they fell out of sync towards the end of the pg_restore, and your method just couldn't detect this fact. At least, I can't think of anything which would cause them to lose sync exactly at the end of pg_restore. Maybe it was just that the first checkpoint after pg_restore was finished caused the necessary WAL files to be recycled. It could have just as easily been a checkpoint running during the pg_restore which caused the problem, but by luck it was not.
Please let me know if I did something wrong above step highlighted ?
I don't think you did anything objectively wrong. You could argue that using wal_keep_segments rather than a replication slot was wrong, or you could say it was not wrong but just a calculated risk. In this case, it seems the risk was realized. Using a replication slot would also be a risk, the risk in that case being that the streaming to replica can't keep up, and so pg_wal fills up to capacity and crashes the master. You have to decide what risk you would rather take.
Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?
Usually the master is your production server. Why would you be refreshing it? Where would you be refreshing it from? What other server exists that contains a higher level of truth than what your master production server already has?
You can use a replication slot, you can increase wal_keep_segments to a larger value (although there is no way to know with certainty ahead of time what value will be large enough), or you can just deal with the risk that your replica may occasionally lose sync and need to be recreated. You might also be able to change the topology so your current replica and current master both stream from the higher-source-of-truth server, rather than cascading changes, first logically and then physically. There is no correct answer, you have to understand and weigh the balance of risks for yourself.
Cheers,
Jeff