Обсуждение: Logical replication lag in seconds

Поиск
Список
Период
Сортировка

Logical replication lag in seconds

От
Klaus Darilion
Дата:
Hello!

I currently use Slony for replication and want to switch to logical
replication. With Slony I was aware of the replication lag of each slave
in seconds. This info was available regardless if a slave was online or
offline.

For my application I need to know the "delay" of all replicas, wether
they are currently active or not. I.e. a replica may be on the other
side of the world, serving queries there, but has problems to connect to
the master. Even in this case I need to know the replication status of
the replica.

With logical replication, it seems the only available data, eve in case
a replica is offline, is in the pg_replication_slots table, for example:

-[ RECORD 1 ]-------+---------------
slot_name           | test_6
plugin              | pgoutput
slot_type           | logical
datoid              | 16402
database            | mydns
temporary           | f
active              | f
active_pid          |
xmin                |
catalog_xmin        | 1116191193
restart_lsn         | 2EA/E61E7FA8
confirmed_flush_lsn | 2EA/E626F0B0

So I guess, the replication status of this replica is the
"confirmed_flush_lsn", ie: 2EA/E626F0B0

But how to I get from 2EA/E626F0B0 to a timestamp when this transaction
was added to the WAL files?

I would appreciate any hints, or other methods to get the delay in some
time format.

thanks
Klaus




Re: Logical replication lag in seconds

От
Michael Lewis
Дата:
I am very interested in this discussion. We settled a table with a single timestamp field that a script updates every minute with NOW() so that we can check the timestamp of that table on the replica, assuming the clocks are synced, then we will be able to compute the lag.

Re: Logical replication lag in seconds

От
Klaus Darilion
Дата:
Hi Michael!

Am 21.02.2020 um 21:24 schrieb Michael Lewis:
> I am very interested in this discussion. We settled a table with a 
> single timestamp field that a script updates every minute with NOW() so 
> that we can check the timestamp of that table on the replica, assuming 
> the clocks are synced, then we will be able to compute the lag.

I have a similar workaround at the moment. But it is more a hack than a 
nice solution, ie. I also have to store the last value locally to have 
the status also available if a replica is temporarily not reachable.

Hence it would be great if the information could be retrieved from 
WAL/replication internals.

regards
Klaus



Re: Logical replication lag in seconds

От
Klaus Darilion
Дата:
For the records - with a simple script I hacked a solution which is
purely based on the server.

1. Create a table to track the timestamp of an lsn:

CREATE TABLE lsn2date(
   lsn pg_lsn PRIMARY KEY,
   seen timestamp NOT NULL DEFAULT NOW()
);
CREATE ROLE replication_lag_user WITH LOGIN PASSWORD 'xxx';
GRANT ALL ON TABLE lsn2date TO replication_lag_user;


2. Create a script which populates the table:

# cat /etc/systemd/system/calculate_logical_replication_lag.service
[Unit]
Description=Start and auto restart service

[Install]
WantedBy=multi-user.target

[Service]
ExecStart=/usr/bin/php /path/to/calculate_logical_replication_lag.php
Restart=always
RestartSec=10


# cat calculate_logical_replication_lag.php
<?php

$dbuser="replication_lag_user";
$dbpass="xxx";
if (!$dbconn = pg_pconnect('host=127.0.0.1 dbname=mydb user='.$dbuser.'
password='.$dbpass)) {
        print "Sorry, database connection failed";
        exit;
}

$accuracy = 10; // in seconds

while (1) {
        $dbq = pg_query("INSERT INTO lsn2date (lsn) VALUES
(pg_current_wal_lsn())");
        if ($dbq === FALSE) {
                mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n");
                exit(1);
        }

        $dbq = pg_query("DELETE FROM lsn2date WHERE lsn < (".
                          "SELECT lsn FROM lsn2date WHERE lsn < (".
                            "SELECT confirmed_flush_lsn FROM
pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1".
                          ") ORDER BY lsn DESC LIMIT 1".
                        ")"
        );
        if ($dbq === FALSE) {
                mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n");
                exit(1);
        }
        sleep($accuracy);
}




3. Get the lag, using a function which compares the lsn of the
replication_slots with the lsn/timestamp in the lsn2date table:



CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE
(subscriber name, lag bigint) AS
$BODY$
DECLARE
    subscriber name;
BEGIN
    FOR subscriber IN
        SELECT slot_name FROM pg_replication_slots
    LOOP
        RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM
NOW()-seen)::bigint lag from lsn2date,pg_replication_slots WHERE
slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC
LIMIT 1;
    END LOOP;
    RETURN;
END
$BODY$
LANGUAGE plpgsql;


# SELECT * FROM get_replication_lag() ;
 subscriber | lag
------------+-----
 reg_sjc1   |   0
 reg_ffm1   |   0
 reg_tst2   |   0
 reg_mia1   |   0
 reg_jbg1   |   0
 reg_ams1   |   0
 reg_syy1   |   0
 reg_wie1   |   0
 reg_hkg1   |   0
 reg_gnf1   |   0
 reg_tor1   |   0
 reg_sea1   |   0
 reg_chi1   |   0
 reg_dfw1   |   0
 reg_sgp1   |   0
 reg_lhr1   |   0



regards
Klaus