Re: Logical replication monitoring

Поиск
Список
Период
Сортировка
От Achilleas Mantzios
Тема Re: Logical replication monitoring
Дата
Msg-id 331d46b8-8e04-6ed6-d86a-fbc6ff1a9af1@matrix.gatewaynet.com
обсуждение исходный текст
Ответ на Re: Logical replication monitoring  (AYahorau@ibagroup.eu)
Список pgsql-admin
On 23/11/18 4:31 μ.μ., AYahorau@ibagroup.eu wrote:
Hello,
Thank you again for the suggestion. I configured tail_n_mail on my SLES12 machine as follows as everything was ok:

LOG_LINE_PREFIX: '%t %e '
EMAIL: someone@example.com
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:  terminating logical replication worker due to timeout
INCLUDE: LOG:  worker process: logical replication worker for subscription [0-9]+ \({1}PID [0-9]+\){1} exited with exit code [0-9]+
FILE: /var/lib/pgsql/pg_log/LATEST

So I configured LOG_LINE_PREFIX  equal to log_prefix_line from postgresql.conf.
So tail_n_mail was able to catch such entries as this one:
2018-11-21 12:41:32 FET 00000 LOG:  worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1

But after I have updated my SLES12 to SP3 tail_n_mail stopped noticing these entry.
2018-11-21 12:41:32 +03 00000 LOG:  worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1
The reason is that  in different output of  timezone: FET was before and  +03 is now.

Maybe (due to the upgrade) you are missing some package having to do with timezone data?



I have two questions in this regard.
I understand that currently tail_n_mail is not completely perfect detector. Is there any other robust, reliable way of logical replication monitoring?

tail_n_mail is the only one I found. In my case it doesn't support : %c (session id), I filed an issue.
Or, you could easily hack the source to fix this regexp from \\w\\w\\w\\w? into smth that matches +03 (e.g. \\+\\d\\d)  and make it work.

Is it possible to configure any particular timestamp format  in postgresql  for its logs? As far as I know it can be only turned on or off depending on %t in log_prefix_line.

Thank you in advance,
Andrei Yahorau



From:        Andrei Yahorau/IBA
To:        pgsql-admin@postgresql.org,
Cc:        Mikalai Keida/IBA@IBA
Date:        24/08/2018 11:49
Subject:        Re: Logical replication monitoring



Hello,
Thank you for the suggestion.
I increased
wal_receiver_timeout , wal_sender_timeout parameters and now this error does not occur.

I installed tail_n_mail utility, made a simple config started in debug mode.
I am constantly facing the same error:


WARNING! Skipping non-existent file "/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
Too many loops (20161): bailing:

The configuration file tail_n_mail.conf is quiet standart:
EMAIL: someone@example.com
PGLOG: log
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:
INCLUDE: FATAL:
INCLUDE: PANIC:
FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034

Could you please say is there anything wrong in my configuration or script usage?

Thank you,
Andrei Yahorau




From:        Andrei Yahorau/IBA
To:        pgsql-admin@postgresql.org,
Cc:        Mikalai Keida/IBA@IBA
Date:        13/08/2018 13:16
Subject:        Re: Logical replication monitoring



Hello!

Thank you for your suggestion.
I  afraid this approach is not suitable for me. As a rule my postgresql log  on subscriber side contains a  bunch of the following entries:

ERROR:  terminating logical replication worker due to timeout

00000 LOG:  worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it  an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout  to 0.
In this case I think that monitoring of the following views  pg_stat_subscription, pg_publication and pg_stat_replication  is enough.

In case if there  is some problem with connection or with replication pg_stat_replication  will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?


Best regards,
Andrei Yahorau




From:        Andrei Yahorau/IBA
To:        pgsql-admin@postgresql.org,
Cc:        Mikalai Keida/IBA@IBA
Date:        10/08/2018 13:05
Subject:        Logical replication monitoring




Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

В списке pgsql-admin по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: User Authentication: LDAP and "local" accounts concurrently ?
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: User Authentication: LDAP and "local" accounts concurrently ?