Обсуждение: Replication streaming issue

Поиск
Список
Период
Сортировка

Replication streaming issue

От
Mai Peng
Дата:
Hello,

I’ve got a strange issue.
Here is the error in pg log:ERROR:  requested WAL segment 000000020000A01A0000004F has already been removed.
Everything is working fine: the lag is ok.
The check queries:
SELECT  client_addr,
        state,
        write_lag,
        flush_lag,
        replay_lag
FROM    pg_stat_replication
WHERE   application_name ='walreceiver’;

SELECT COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;

How could I monitor the problem? Why the replication streaming is still working?

Thank you in advance.
Mai






Re: Replication streaming issue

От
Keith
Дата:


On Tue, Jul 23, 2019 at 9:39 AM Mai Peng <maily.peng@webedia-group.com> wrote:
Hello,

I’ve got a strange issue.
Here is the error in pg log:ERROR:  requested WAL segment 000000020000A01A0000004F has already been removed.
Everything is working fine: the lag is ok.
The check queries:
SELECT  client_addr,
        state,
        write_lag,
        flush_lag,
        replay_lag
FROM    pg_stat_replication
WHERE   application_name ='walreceiver’;

SELECT COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;

How could I monitor the problem? Why the replication streaming is still working?

Thank you in advance.
Mai
 

Are you sure replication is still actually working? Your first query just checks to see if a streaming replica is connected. Without doing further calculations with that info, it doesn't tell you if it's actually replicating. Your second query can be misleading if the calculation returns null and will just instead tell you zero. Also if no writes are actually occurring on the primary, it can lead to false positives about it actually being behind since it just tells you the last time a WAL file was replayed. If no writes are happening, then no WAL will be replayed. It can still be useful as a monitoring query in general, but shouldn't be your only replication monitoring method.

Try creating a new object on your primary and see if it actually shows up on the replica. If not, this means your replica needs to be rebuilt for it to resume replication unless you have your WAL files backed up somewhere else.

I've written a post about better monitoring practices for replicas - https://www.keithf4.com/monitoring_streaming_slave_lag/

Keith