Stalled replica (terminating connection due to conflict with recovery)

Поиск

Список

Период

Сортировка

От	Kouber Saparev
Тема	Stalled replica (terminating connection due to conflict with recovery)
Дата	2 февраля 2023 г. 01:37:05
Msg-id	CAN4RuQs6OfWZonT9DjiLGtwA7p+rifOn4GKCgUoNx1zMYmTjPA@mail.gmail.com обсуждение исходный текст
Список	pgsql-admin

Дерево обсуждения

We are using a cascading streaming replication setup with PostgreSQL 14.6:

db1 -> db2 -> db.warehouse

hot_standby_feedback = on

max_standby_streaming_delay = 15min

On the master node (db1) some piece of code is creating and dropping temporary tables all the time, thus bloating pg_attribute, pg_depend, etc. The rate is around 1500 inserts + 1500 deletes per second on pg_attribute. As a result autovacuum is launched every minute on the latter.

The situation we are facing is that in some occasions the replica (db.warehouse) becomes unresponsive for a specific database - nobody could connect to the database (through psql or other client) for several minutes until eventually we restart the database. Trying to connect with psql (or pg_activity) hangs forever.

Connecting to other databases, such as postgres, succeeds. Then eventually we could see the list of queries within pg_stat_activity and kill them all with pg_terminate_backend(). Regardless of that, the database in question is still refusing new connections, throwing some "terminating connection due to conflict with recovery" errors eventually in the log.

There are 3 rows within pg_stat_activity (except the one issuing the query to get them):

RECORD 1:

backend_start | 2023-02-01 13:21:44.167132+00
wait_event_type | IPC
wait_event | RecoveryConflictSnapshot

RECORD 2:

backend_start | 2023-02-01 13:21:44.202399+00
wait_event_type | Activity
wait_event | BgWriterMain

RECORD 3:

backend_start | 2023-02-01 13:21:44.202096+00
wait_event_type | Activity
wait_event | CheckpointerMain

No long running queries, nothing... just these 3 lines and a database that is not willing to accept connections.

Any idea what might be causing this behaviour? We highly suspect the crazy writing over pg_attribute on the master node, but still - a cancelled query because of a conflict is one thing, a stalled database not accepting connections for anybody - a little disaster.

Any advice on how to avoid this would be highly appreciated.

Kouber Saparev

В списке pgsql-admin по дате отправления:

Предыдущее

От: Tom Lane
Дата: 02 февраля 2023 г., 01:18:17
Сообщение: Re: Mutex error 22 - Postgres version 14

Следующее

От: sireesha
Дата: 02 февраля 2023 г., 01:45:31
Сообщение: Re: Mutex error 22 - Postgres version 14

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Stalled replica (terminating connection due to conflict with recovery)

Предыдущее

Следующее