RE: Fix slot synchronization with two_phase decoding enabled
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Fix slot synchronization with two_phase decoding enabled |
Дата | |
Msg-id | OS0PR01MB5716E010AA629083EF5ADA3694802@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | RE: Fix slot synchronization with two_phase decoding enabled ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Список | pgsql-hackers |
On Mon, Apr 28, 2025 at 7:33 PM Zhijie Hou (Fujitsu) wrote: > > Thanks for reviewing. Here is V3 patch that addressed it. > > BTW, I also did some tests to confirm the catalog_xmin could still be > ahead in some case, and here is an example: > > 1. Create a failover replication slot named 'logicalslot' on primary > and acquire it in the walsender. > > 2. Log two standby snapshots on primary. Before logging, call > txid_current() To assign a xid, so that each standby snapshot will > hold a new xid in its oldestrunningXid field: > - txid_current(); > - `0/3000420` - `running_xacts` (no running transactions, > oldestrunningXid = 755) > - txid_current(); > - `0/3000488` - `running_xacts` (no running transactions, > oldestrunningXid = 756) > > 3. The walsender sets `0/3000420` as the `candidate_restart_lsn`, 755 as > `candidate_catalog_xmin`, skipping the second `running_xacts` because > `candidate_restart_lsn`/`candidate_catalog_xmin` is already valid. > > 4. After receiving a reply from the apply worker, the walsender assigns > `0/3000420` to `restart_lsn`, `755` to `catalog_xmin`. At this point, the > replication slot 'logicalslot' has `restart_lsn` set to `0/3000420`, > `catalog_xmin` set to `755`. > > 5. On the standby, execute `pg_sync_replication_slots()` to synchronize > 'logicalslot'. > > 6. During synchronization, with the initial `restart_lsn` at `0/3000420`, the > sync slot reaches a consistent point at this position. As a result, it does > not update `candidate_restart_lsn` and `candidate_catalog_xmin` at > consistent point (refer to `SnapBuildProcessRunningXacts()`). > > 7. The sync process identifies the second standby snapshot at > `0/3000488` and > uses its LSN as `candidate_restart_lsn`, and use the > oldestrunningXid `756` > as `candidate_catalog_xmin`, eventually updating it to `restart_lsn` and > `catalog_xmin`. > > 8. Now, the synced slot holds `restart_lsn` at `0/3000488`, `catalog_xmin` at > `756`, which are all ahead of the remote slot on the primary server. > > Attaching a script to reproduce the same. > > Note that, to reproduce this stably, we'd better modify the value of > LOG_SNAPSHOT_INTERVAL_MS in bgwriter.c to a bigger number to avoid > unexpected xl_running_xacts logging. In addition to above steps, for those interested in reproducing the specific scenario where two_phase_at advances past the synced confirmed_flush, I'm attaching a new script. This script can reproduce the issue after applying the injection points I provided. Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: