Обсуждение: Streaming Replication Error

Поиск
Список
Период
Сортировка

Streaming Replication Error

От
Дата:

Hi Experts,

 

I have set up streaming replication on PostgreSQL 12.1

 

The master and slave are configured as below and WAL files are accumulating on the master.

 

However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER

 

MASTER

 

postgres@srvm:~$ 2019-12-20 16:35:07.910 CET [1334] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:12.920 CET [1338] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:17.925 CET [1340] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:22.932 CET [1362] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:27.935 CET [1364] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:32.942 CET [1365] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:37.948 CET [1366] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:35:42.954 CET [1367] replicator@[unknown] ERROR:  requested WAL segment 000000010000000100000076 has already been removed

 

 

SLAVE

 

postgres@srvs:~$ 2019-12-20 16:36:53.027 CET [21978] LOG:  started streaming WAL from primary at 1/76000000 on timeline 1

2019-12-20 16:36:53.027 CET [21978] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:36:58.029 CET [21979] LOG:  started streaming WAL from primary at 1/76000000 on timeline 1

2019-12-20 16:36:58.029 CET [21979] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:37:03.040 CET [21980] LOG:  started streaming WAL from primary at 1/76000000 on timeline 1

2019-12-20 16:37:03.040 CET [21980] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

2019-12-20 16:37:08.042 CET [21981] LOG:  started streaming WAL from primary at 1/76000000 on timeline 1

2019-12-20 16:37:08.042 CET [21981] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

 

 

Then the pg_basebackup is run and the slave started.

 

The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.

 

max_wal_senders = 10         

wal_keep_segments = 120

 

What have I mis-configured? Do we need to enabled archive_mode = on for streaming replication?

Re: Streaming Replication Error

От
Jeff Janes
Дата:
On Fri, Dec 20, 2019 at 11:08 AM <soumik.bhattacharjee@kpn.com> wrote:

Hi Experts,

 

I have set up streaming replication on PostgreSQL 12.1

 

The master and slave are configured as below and WAL files are accumulating on the master.

 

However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER


You ran pg_restore on the master?  Why did you do that?  Doesn't that mean you now have a new master, different from the old master?  Did you restore into a new instance, or into the existing instance?  What command line did you use?

 

SLAVE

 ...

2019-12-20 16:37:08.042 CET [21981] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

 

 

Then the pg_basebackup is run and the slave started.


Presumably you just didn't restart the slave.  You blew away the old one entirely, and started the new one created by the pg_basebackup.

What was the command line used for pg_basebackup?
 

 

The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.


This is confusing.  You got the above errors before you did the new pg_basebackup, or after?
 

 

max_wal_senders = 10         

wal_keep_segments = 120

 

What have I mis-configured?


Your email doesn't seem to be written in chronological order, and you didn't include the parameters for the commands you ran, so it is hard to say what you did wrong, as we don't know what you did.

It could be that your replica is seeking files from the wrong master.
 

Do we need to enabled archive_mode = on for streaming replication?


That can sometimes be useful, but it is not necessary.  You can use a replication slot to force the master to retain sufficient logs, or you can set wal_keep_segments high enough that it (probably) keeps enough on its own.

One reason I sometimes find archive_mode to be useful in a streaming setup is that I can inject a compression step.  WAL files compress very well, and you don't get that compression with a streaming connection.  So if you fall behind and have a slow network, you can catch up much faster using a compressed archive.

Cheers,

Jeff

RE: Streaming Replication Error

От
Дата:

Thanks Jeff for your valuable inputs.

 

The setup-

 

MASTER

1.      I did the below –

psql -c "ALTER SYSTEM SET listen_addresses TO '*'";

2.      Created the user needed for replication

3.      Updated the pg_hba.conf

            echo "host replication replicator IP Address/32 md5" >> /etc/postgresql/12/main/pg_hba.conf

psql -c "select pg_reload_conf()"

 

SLAVE

1.      Stopped the postgres on slave

2.      rm -rf /var/lib/postgresql/12/main/*

3.      I ran the below one in Slave once all needed setups were done

                  pg_basebackup -h IP Address -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R

4.      Started the PostgreSQL using pg_ctl on the Slave.

 

-          Taken backup using pg_dump from a different instance.

-          Did a pg_restore on MASTER (The existing instance - MASTER)

-          The SLAVE  has all the data after the restore was done in MASTER – both were in sync.

-          But after that any new objects created on MASTER was not reflected on SLAVE the errors started coming in.

 

 

Please let me know if I did something wrong above step highlighted ? Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?

 

Thanks

 

 

From: Jeff Janes <jeff.janes@gmail.com>
Sent: vrijdag 20 december 2019 17:45
To: Bhattacharjee, Soumik <soumik.bhattacharjee@kpn.com>
Cc: pgsql-admin@lists.postgresql.org; pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: Streaming Replication Error

 

On Fri, Dec 20, 2019 at 11:08 AM <soumik.bhattacharjee@kpn.com> wrote:

Hi Experts,

 

I have set up streaming replication on PostgreSQL 12.1

 

The master and slave are configured as below and WAL files are accumulating on the master.

 

However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER

 

You ran pg_restore on the master?  Why did you do that?  Doesn't that mean you now have a new master, different from the old master?  Did you restore into a new instance, or into the existing instance?  What command line did you use?

 

 

SLAVE

 ...

2019-12-20 16:37:08.042 CET [21981] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000100000076 has already been removed

 

 

Then the pg_basebackup is run and the slave started.

 

Presumably you just didn't restart the slave.  You blew away the old one entirely, and started the new one created by the pg_basebackup.

 

What was the command line used for pg_basebackup?

 

 

The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.

 

This is confusing.  You got the above errors before you did the new pg_basebackup, or after?

 

 

max_wal_senders = 10         

wal_keep_segments = 120

 

What have I mis-configured?

 

Your email doesn't seem to be written in chronological order, and you didn't include the parameters for the commands you ran, so it is hard to say what you did wrong, as we don't know what you did.

 

It could be that your replica is seeking files from the wrong master.

 

Do we need to enabled archive_mode = on for streaming replication?

 

That can sometimes be useful, but it is not necessary.  You can use a replication slot to force the master to retain sufficient logs, or you can set wal_keep_segments high enough that it (probably) keeps enough on its own.

 

One reason I sometimes find archive_mode to be useful in a streaming setup is that I can inject a compression step.  WAL files compress very well, and you don't get that compression with a streaming connection.  So if you fall behind and have a slow network, you can catch up much faster using a compressed archive.

 

Cheers,

 

Jeff

 

 

Re: Streaming Replication Error

От
Jeff Janes
Дата:
On Fri, Dec 20, 2019 at 2:12 PM <soumik.bhattacharjee@kpn.com> wrote:

-          Did a pg_restore on MASTER (The existing instance - MASTER)

-          The SLAVE  has all the data after the restore was done in MASTER – both were in sync.

How did you determine that it had all the data?  I suspect that they fell out of sync towards the end of the pg_restore, and your method just couldn't detect this fact.  At least, I can't think of anything which would cause them to lose sync exactly at the end of pg_restore.  Maybe it was just that the first checkpoint after pg_restore was finished caused the necessary WAL files to be recycled.  It could have just as easily been a checkpoint running during the pg_restore which caused the problem, but by luck it was not.
 

Please let me know if I did something wrong above step highlighted ?


I don't think you did anything objectively wrong.  You could argue that using wal_keep_segments rather than a replication slot was wrong, or you could say it was not wrong but just a calculated risk.  In this case, it seems the risk was realized.  Using a replication slot would also be a risk, the risk in that case being that the streaming to replica can't keep up, and so pg_wal fills up to capacity and crashes the master.  You have to decide what risk you would rather take.
 

Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?


Usually the master is your production server.  Why would you be refreshing it?  Where would you be refreshing it from?  What other server exists that contains a higher level of truth than what your master production server already has?

You can use a replication slot, you can increase wal_keep_segments to a larger value (although there is no way to know with certainty ahead of time what value will be large enough), or you can just deal with the risk that your replica may occasionally lose sync and need to be recreated.  You might also be able to change the topology so your current replica and current master both stream from the higher-source-of-truth server, rather than cascading changes, first logically and then physically. There is no correct answer, you have to understand and weigh the balance of risks for yourself.

Cheers,

Jeff

RE: Streaming Replication Error

От
Дата:

Thanks Jeff very well explained.

 

Thanks and have nice weekend J

 

 

From: Jeff Janes <jeff.janes@gmail.com>
Sent: vrijdag 20 december 2019 21:38
To: Bhattacharjee, Soumik <soumik.bhattacharjee@kpn.com>
Cc: pgsql-admin@lists.postgresql.org; pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: Streaming Replication Error

 

On Fri, Dec 20, 2019 at 2:12 PM <soumik.bhattacharjee@kpn.com> wrote:

-          Did a pg_restore on MASTER (The existing instance - MASTER)

-          The SLAVE  has all the data after the restore was done in MASTER – both were in sync.

How did you determine that it had all the data?  I suspect that they fell out of sync towards the end of the pg_restore, and your method just couldn't detect this fact.  At least, I can't think of anything which would cause them to lose sync exactly at the end of pg_restore.  Maybe it was just that the first checkpoint after pg_restore was finished caused the necessary WAL files to be recycled.  It could have just as easily been a checkpoint running during the pg_restore which caused the problem, but by luck it was not.

 

 

Please let me know if I did something wrong above step highlighted ?

 

I don't think you did anything objectively wrong.  You could argue that using wal_keep_segments rather than a replication slot was wrong, or you could say it was not wrong but just a calculated risk.  In this case, it seems the risk was realized.  Using a replication slot would also be a risk, the risk in that case being that the streaming to replica can't keep up, and so pg_wal fills up to capacity and crashes the master.  You have to decide what risk you would rather take.

 

Does that mean I cannot refresh the MASTER anytime which should replicate to SLAVE?

 

Usually the master is your production server.  Why would you be refreshing it?  Where would you be refreshing it from?  What other server exists that contains a higher level of truth than what your master production server already has?

 

You can use a replication slot, you can increase wal_keep_segments to a larger value (although there is no way to know with certainty ahead of time what value will be large enough), or you can just deal with the risk that your replica may occasionally lose sync and need to be recreated.  You might also be able to change the topology so your current replica and current master both stream from the higher-source-of-truth server, rather than cascading changes, first logically and then physically. There is no correct answer, you have to understand and weigh the balance of risks for yourself.

 

Cheers,

 

Jeff