Обсуждение: Logical replication monitoring

Поиск

Список

Период

Сортировка

Logical replication monitoring

От

AYahorau@ibagroup.eu

Дата:

10 августа 2018 г., 13:05:06

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau

Re: Logical replication monitoring

От

Achilleas Mantzios

Дата:

10 августа 2018 г., 16:28:10

On 10/08/2018 13:05, AYahorau@ibagroup.eu wrote:

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

tail_n_mail and look for relevant errors (logical, logical replication worker ... exited with ..., ERROR)

Best regards,
Andrei Yahorau

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: Logical replication monitoring

От

AYahorau@ibagroup.eu

Дата:

13 августа 2018 г., 13:16:50

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Re: Logical replication monitoring

От

Rajni Baliyan

Дата:

14 августа 2018 г., 06:24:11

Check below link. Might help you.

https://www.postgresql.org/docs/10/static/replication-origins.html

On Mon, Aug 13, 2018 at 8:16 PM, <AYahorau@ibagroup.eu> wrote:

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau

Re: Logical replication monitoring

От

AYahorau@ibagroup.eu

Дата:

14 августа 2018 г., 16:50:08

Hello,
Sorry for being persistent,

Could you please explain how to use these commands? Could you please provide an example?

Best regards,
Andrei Yahorau

From: Rajni Baliyan <saan654@gmail.com>
To: AYahorau@ibagroup.eu,
Cc: pgsql-admin <pgsql-admin@postgresql.org>, MikalaiKeida@ibagroup.eu
Date: 14/08/2018 06:24
Subject: Re: Logical replication monitoring

Check below link. Might help you.

https://www.postgresql.org/docs/10/static/replication-origins.html

On Mon, Aug 13, 2018 at 8:16 PM, <AYahorau@ibagroup.eu> wrote:
Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Re: Logical replication monitoring

От

Achilleas Mantzios

Дата:

20 августа 2018 г., 09:17:02

On 13/08/2018 13:16, AYahorau@ibagroup.eu wrote:

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

In my short experience with logical replication, those ERRORs should be avoided, you better increase the relevant timeouts: wal_receiver_timeout , wal_sender_timeout . One hand tool we use here is tail_n_mail , it is tailored specifically for parsing PostgreSQL logs and sending emails with alerts.

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: Logical replication monitoring

От

AYahorau@ibagroup.eu

Дата:

24 августа 2018 г., 11:49:39

Hello,
Thank you for the suggestion.
I increased wal_receiver_timeout , wal_sender_timeout parameters and now this error does not occur.

I installed tail_n_mail utility, made a simple config started in debug mode.
I am constantly facing the same error:

WARNING! Skipping non-existent file "/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
Too many loops (20161): bailing:

The configuration file tail_n_mail.conf is quiet standart:
EMAIL: someone@example.com
PGLOG: log
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:
INCLUDE: FATAL:
INCLUDE: PANIC:
FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034

Could you please say is there anything wrong in my configuration or script usage?

Thank you,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 13/08/2018 13:16
Subject: Re: Logical replication monitoring

Re: Logical replication monitoring

От

Achilleas Mantzios

Дата:

24 августа 2018 г., 12:18:26

On 24/08/2018 11:49, AYahorau@ibagroup.eu wrote:

Hello,

Hello

Thank you for the suggestion.
I increased wal_receiver_timeout , wal_sender_timeout parameters and now this error does not occur.

Glad it worked!

I installed tail_n_mail utility, made a simple config started in debug mode.
I am constantly facing the same error:

I had my PITA moments with tail_n_mail myself, it seems it doesn't want to work with %c (session id) in the log_line_prefix. Anyways, some comments :
- are you doing anything exotic with your log files? which such file patterns? why do you need to create a new file every second? This is bad for many reasons.
- did you specify the same LOG_LINE_PREFIX in your tnm conf file as in postgresql.conf ?

In any case, for tnm related problems you better ask the ppl you maintain it. For the time being just make sure that your replication works.

WARNING! Skipping non-existent file "/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
Too many loops (20161): bailing:

The configuration file tail_n_mail.conf is quiet standart:
EMAIL: someone@example.com
PGLOG: log
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:
INCLUDE: FATAL:
INCLUDE: PANIC:
FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034

Could you please say is there anything wrong in my configuration or script usage?

Thank you,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 13/08/2018 13:16
Subject: Re: Logical replication monitoring

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: Logical replication monitoring

От

AYahorau@ibagroup.eu

Дата:

23 ноября 2018 г., 14:31:18

Hello,
Thank you again for the suggestion. I configured tail_n_mail on my SLES12 machine as follows as everything was ok:

LOG_LINE_PREFIX: '%t %e '
EMAIL: someone@example.com
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR: terminating logical replication worker due to timeout
INCLUDE: LOG: worker process: logical replication worker for subscription [0-9]+ \({1}PID [0-9]+\){1} exited with exit code [0-9]+
FILE: /var/lib/pgsql/pg_log/LATEST

So I configured LOG_LINE_PREFIX equal to log_prefix_line from postgresql.conf.
So tail_n_mail was able to catch such entries as this one:
2018-11-21 12:41:32 FET 00000 LOG: worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1

But after I have updated my SLES12 to SP3 tail_n_mail stopped noticing these entry.
2018-11-21 12:41:32 +03 00000 LOG: worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1
The reason is that in different output of timezone: FET was before and +03 is now.

I have two questions in this regard.
I understand that currently tail_n_mail is not completely perfect detector. Is there any other robust, reliable way of logical replication monitoring?
Is it possible to configure any particular timestamp format in postgresql for its logs? As far as I know it can be only turned on or off depending on %t in log_prefix_line.

Thank you in advance,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 24/08/2018 11:49
Subject: Re: Logical replication monitoring

Re: Logical replication monitoring

От

Achilleas Mantzios

Дата:

23 ноября 2018 г., 14:56:55

On 23/11/18 4:31 μ.μ., AYahorau@ibagroup.eu wrote:

Hello,
Thank you again for the suggestion. I configured tail_n_mail on my SLES12 machine as follows as everything was ok:

LOG_LINE_PREFIX: '%t %e '
EMAIL: someone@example.com
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR: terminating logical replication worker due to timeout
INCLUDE: LOG: worker process: logical replication worker for subscription [0-9]+ \({1}PID [0-9]+\){1} exited with exit code [0-9]+
FILE: /var/lib/pgsql/pg_log/LATEST

So I configured LOG_LINE_PREFIX equal to log_prefix_line from postgresql.conf.
So tail_n_mail was able to catch such entries as this one:
2018-11-21 12:41:32 FET 00000 LOG: worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1

But after I have updated my SLES12 to SP3 tail_n_mail stopped noticing these entry.
2018-11-21 12:41:32 +03 00000 LOG: worker process: logical replication worker for subscription 16386 (PID 31777) exited with exit code 1
The reason is that in different output of timezone: FET was before and +03 is now.

Maybe (due to the upgrade) you are missing some package having to do with timezone data?

I have two questions in this regard.
I understand that currently tail_n_mail is not completely perfect detector. Is there any other robust, reliable way of logical replication monitoring?

tail_n_mail is the only one I found. In my case it doesn't support : %c (session id), I filed an issue.
Or, you could easily hack the source to fix this regexp from \\w\\w\\w\\w? into smth that matches +03 (e.g. \\+\\d\\d) and make it work.

Is it possible to configure any particular timestamp format in postgresql for its logs? As far as I know it can be only turned on or off depending on %t in log_prefix_line.

Thank you in advance,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 24/08/2018 11:49
Subject: Re: Logical replication monitoring

Hello,
Thank you for the suggestion.
I increased wal_receiver_timeout , wal_sender_timeout parameters and now this error does not occur.

I installed tail_n_mail utility, made a simple config started in debug mode.
I am constantly facing the same error:

WARNING! Skipping non-existent file "/var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034"
Too many loops (20161): bailing:

The configuration file tail_n_mail.conf is quiet standart:
EMAIL: someone@example.com
PGLOG: log
MAILSUBJECT: Acme HOST Postgres errors UNIQUE : NUMBER
INCLUDE: ERROR:
INCLUDE: FATAL:
INCLUDE: PANIC:
FILE1: /var/lib/pgsql/pg_log/postgresql.log-%Y-%m-%d_%H%M%S
LASTFILE1: /var/lib/pgsql/pg_log/postgresql.log-2018-08-23_154034

Could you please say is there anything wrong in my configuration or script usage?

Thank you,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 13/08/2018 13:16
Subject: Re: Logical replication monitoring

Hello!

Thank you for your suggestion.
I afraid this approach is not suitable for me. As a rule my postgresql log on subscriber side contains a bunch of the following entries:

ERROR: terminating logical replication worker due to timeout
00000 LOG: worker process: logical replication worker for subscription 24578 (PID 6217) exited with exit code 1

How should I handle this situation?
As I understand this is quite normal situation. But why is severity for it an ERROR ?

I have another assumption. Could you correct me if I am wrong.
I found out in the source code that logical replication worker termination depends on wal_receiver_timeout paramer.
So I propose setting wal_receiver_timeout to 0.
In this case I think that monitoring of the following views pg_stat_subscription, pg_publication and pg_stat_replication is enough.
In case if there is some problem with connection or with replication pg_stat_replication will show nothing because wal sender will not be working otherwise it will give some information.
Am I right? Are there any vulnerabilities in this approach ?

Best regards,
Andrei Yahorau

From: Andrei Yahorau/IBA
To: pgsql-admin@postgresql.org,
Cc: Mikalai Keida/IBA@IBA
Date: 10/08/2018 13:05
Subject: Logical replication monitoring

Hello PostgreSQL Community!

I configured logical replication for PostgreSQL 10.4 on 2 machines, set wal_level to logical, created a publication on master node and created a subscription on standby node according to the PostgreSQL documentation.
Could you please suggest an approach for replication state monitoring.

According to my experience the monitoring of pg_stat_subscription and pg_publication, pg_replication_slots unfortunately is not enough for this aim. Moreover standby database does not prohibit write operations by default and it can lead to some inconsistency between these databases.

For example a chain of queries as
SELECT pg_is_is_recovery(),
SELECT * FROM pg_stat_replication and
SELECT * FROM pg_stat_wal_receiver
provide insight into replication state for hot_standby replication.

So is there a reliable way of replication state monitoring for logical replication?

Best regards,
Andrei Yahorau

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Logical replication monitoring