Read Replica Inconsistencies

Поиск
Список
Период
Сортировка
От Giuliano Sofi
Тема Read Replica Inconsistencies
Дата
Msg-id CAHjPeqvLb9ewP0ubvZdCaafZWriJRuE3Qr7hUkNn5qZQfBxD_A@mail.gmail.com
обсуждение исходный текст
Список pgsql-bugs

Hello folks,

In the last few weeks we started facing some strange behavior in our AWS RDS Replica. Following are all the details about our setup, the problem we are experiencing and the investigation we did.

Our setup

Service: Amazon RDS Postgres Replica

Postgres Version: 13.7

Replication Model: Single-Master

Class: Unidirectional

Mode: Asynchronous Streaming

Type: Physical Replication

Resources: Both the Primary and the Read Replica are managed by Amazon RDS. The Primary instance is a db.m6g.2xlarge, instead the replica is db.m6g.large. Both have 640 GB gp2 disks and Multi AZ option enabled.

The problem

We are experiencing violation of primary key constraints with the replication process, in detail:

A couple of times per day, some tables for which a primary key is defined on the master, contain duplicated records on the read replica (the primary key constraint is violated), and we don’t get any error/warning from the replica instance. Such duplications are found with our tests and monitoring tools, and the replica usually removes them in a couple of minutes, but in some rare cases it needs up to 4 hours for getting back to a consistent state. 

Configuration

We tried to compare the configuration of the two instances without finding big differences, anyway we are attaching them to this message so you can check them.

We would like to understand if there is something we can tune to prevent such duplications in our production replica.


Here you can find the configuration files: attachments

Investigation

Transaction Log Disk Usage

We use our replica instance as the source for our internal data pipelines. These pipelines run on an hourly schedule. From the monitoring dashboards we see that each time the ETLs start reading, there is an increase on the replica LAG metric, that is expected. What is less expected, it’s an occasional spike in the Transaction Log Disk Usage metric that shows peaks up to 3GB, and seems unrelated to the LAG metric.


Do you think these spikes could be related to our issue?

13.8 version TransactionIdIsInProgress bugfix

We've also noticed that, in the 13.8 release, a bug related to replication has been addressed. The https://www.postgresql.org/docs/release/13.8/ states that:

  • Fix race condition when checking transaction visibility (Simon Riggs)
    TransactionIdIsInProgress could report false before the subject transaction is considered visible, leading to various misbehaviors. The race condition window is normally very narrow, but use of synchronous replication makes it much wider, because the wait for a synchronous replica happens in that window.

Do you think that we could be affected by this issue? Since Amazon RDS doesn't support such a version, we can't upgrade to check if this solves the problem. Does this fix have been also ported in the 14.x version of Postgres? If yes, and you agree that this could be our issue, we could evaluate an upgrade to address the issue as soon as possible.

Master serial primary keys are varchar in the replica

In the replica instance id columns of serial type are turned to varchar. Given a table on the master with a primary key composed of a SERIAL column, the replicated table DDL is different. The difference is that the replicated table primary key column is a VARCHAR instead of a SERIAL/INT. We do not apply any strategy/configuration for dealing with ids, so this conversion is made automatically during the replication build/streaming.

The following screenshots depict the difference in primary key type of the master table and its replicated version:


If the serial4 type cannot be kept during replication, we would expect at least to fallback on a number format and be subject to the defined constraints.

We thought about possible problems related to the unique constraint caused by collation problems, so we ran the bt_index_check() function from the amcheck extension on the primary key indexes of the replica, without finding any issue.

Please could you help us on figuring out how to solve this issue?

Thanks a lot for your help

The CloudAcademy team

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17385: "RESET transaction_isolation" inside serializable transaction causes Assert at the transaction end
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #17385: "RESET transaction_isolation" inside serializable transaction causes Assert at the transaction end