Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
От | Euler Taveira |
---|---|
Тема | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Дата | |
Msg-id | 86e60caf-d994-4b97-b504-22db9c825c52@app.fastmail.com обсуждение исходный текст |
Ответ на | BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Список | pgsql-bugs |
On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
I'm in the process of converting our databases from pglogical logicalreplication to the native logical replication implementation on PostgreSQL17. One of the bugs we encountered and had to work around with pglogical wasthe plugin dropping records while converting to a streaming replica tological via pglogical_create_subscriber (reportedhttps://github.com/2ndQuadrant/pglogical/issues/349). I was trying toconfirm that the native logical replication implementation did not have thisproblem, and I've found that it might have a different problem.
pg_createsubscriber uses a different approach than pglogical. While pglogical
uses a restore point, pg_createsubscriber uses the LSN from the latest
replication slot as a replication start point. The restore point approach is
usually suitable to physical replication but might not cover all scenarios for
logical replication (such as when there are in progress transactions). Since
creating a logical replication slot does find a consistent decoding start
point, it is a natural choice to start the logical replication (that also needs
to find a decoding start point).
I should say that I've been operating under the assumption thatpg_createsubscriber is designed for use on a replica for a *live* primarydatabase, if this isn't correct then someone please let me know.
pg_createsubscriber expects a physical replica that is preferably stopped
before running it.
I have a script that I've been using to reproduce the issue (pasted at endof email, because this bug reporting page doesn't seem to supportattachments). It basically performs a loop that sets up a primary and aphysical replica, generates some load, converts the replica to logical,waits, and makes sure the row counts are the same.
If I run your tests, it reports
$ NUM_THREADS=40 INSERT_WIDTH=1000 /tmp/logical_stress_test.sh
.
.
*** Successfully started logical replica on port 5341.
*** ALL INSERT LOOPS FINISHED
SOURCE COUNT = 916000
DEST COUNT = 768000
ERROR: record count mismatch
but after some time
$ psql -X -p 5340 -c "SELECT count(f1) FROM test_table" -d test_db
count
--------
916000
(1 row)
$ psql -X -p 5341 -c "SELECT count(f1) FROM test_table" -d test_db
count
--------
916000
(1 row)
I also checked the data
$ pg_dump -t test_table -p 5340 -d test_db -f - | sort > /tmp/p.out
$ pg_dump -t test_table -p 5341 -d test_db -f - | sort > /tmp/r.out
$ diff -q /tmp/p.out /tmp/r.out
$ echo $?
0
Your script is not waiting enough time until it applies the backlog. Unless,
you are seeing a different symptom, there is no bug.
You should have used something similar to wait_for_subscription_sync routine
(Cluster.pm) before counting the rows. That's what is used in the
pg_createsubscriber tests. It guarantees the subscriber has caught up.
В списке pgsql-bugs по дате отправления: