Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
От | Shlok Kyal |
---|---|
Тема | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Дата | |
Msg-id | CANhcyEUs+_fgmd61jWiSvwxYz+-DGgL00q=C5ZdoYaj9D9baWw@mail.gmail.com обсуждение исходный текст |
Ответ на | RE: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Список | pgsql-bugs |
On Tue, 22 Jul 2025 at 17:51, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shlok, > > > I checked it and here is my analysis: > > > > When we create a slot, it returns the confirmed_flush LSN as a > > consistent_lsn. I noticed that in general when we create a slot, the > > confirmed_flush is set to the end of a RUNNING_XACT log or we can say > > start of the next record. And this next record can be anything. Ii can > > be a COMMIT record for a transaction in another session. > > I have attached server logs and waldump logs for one of such case > > reproduced using test script shared in [1]. > > The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT, > > FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a > > RUNNING_XACT is logged. > ... > > Thanks for the analysis! It is quite helpful. Based on your point I understood > like below. Are they correct? > > Facts: > ===== > 1. > RUNNING_XACT records can be generated when the snapshot status is advanced while > creating the slot. > 2. > pg_create_logical_replication_slot() returns the end point of RUNNING_XACT. > It was generated when the snapshot becomes SNAPBUILD_CONSISTENT. > 3. > Some transactions could be started while the snapshot is FULL_SNAPSHOT state, and > they can be committed after we reached SNAPBUILD_CONSISTENT. Such transactions > should be output by the upcoming logical decoding. > > What happened here: > ================= > a. > confirmed_flush_lsn was 0/03CBCCA0, which is end of RUNNING_XACT (lsn: 0/03CBCC58). > Also, a COMMIT record for txn 1369 located *just after* the RUNNING_XACT [1]. > b. > pg_createsubscriber set the recovery_target_lsn to "0/03CBCCA0", and > recovery_target_inclusive was true. This meant record stared from "0/03CBCCA0" > must be applied. > c. > startup process applied till that point. Transaction 1369 was applied and then the > standby could be promoted. > e. > logical walsender decoded transaction 1369 and replicated it to the standby. > However, it has already been applied by startup thus conflict could happen. > > [1]: > according to the log: > ``` > ... > rmgr: Standby len (rec/tot): 70/ 70, tx: 0, lsn: 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid1370 latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369 1368 1367 > rmgr: Transaction len (rec/tot): 46/ 46, tx: 1369, lsn: 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-2016:50:18.031146 IST > ... > ``` > > Best regards, > Hayato Kuroda > FUJITSU LIMITED > Hi Kuroda-san, Thanks for reviewing the thread. Your understanding is correct. Thanks, Shlok Kyal
В списке pgsql-bugs по дате отправления: