RE: Replication slot is not able to sync up
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Replication slot is not able to sync up |
Дата | |
Msg-id | OS0PR01MB5716C920F4F6519D74564EB89461A@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Replication slot is not able to sync up (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: Replication slot is not able to sync up
|
Список | pgsql-hackers |
On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > In the case presented here, the logical slot is expected to keep > > forwarding, and in the consecutive sync cycle, the sync should be > > successful. Users using logical decoding APIs should also be aware > > that if due for some reason, the logical slot is not moving forward, > > the master/publisher node will start accumulating dead rows and WAL, > > which can create bigger problems. > > I've tried this case and am concerned that the slot synchronization using > pg_sync_replication_slots() would never succeed while the primary keeps > getting write transactions. Even if the user manually consumes changes on the > primary, the primary server keeps advancing its XID in the meanwhile. On the > standby, we ensure that the > TransamVariables->nextXid is beyond the XID of WAL record that it's > going to apply so the xmin horizon calculated by > GetOldestSafeDecodingTransactionId() ends up always being higher than the > slot's catalog_xmin on the primary. We get the log message "could not > synchronize replication slot "s" because remote slot precedes local slot" and > cleanup the slot on the standby at the end of pg_sync_replication_slots(). To improve this workload scenario, we can modify pg_sync_replication_slots() to wait for the primary slot to advance to a suitable position before completing synchronization and removing the temporary slot. This would allow the sync to complete as soon as the primary slot advances, whether through pg_logical_xx_get_changes() or other ways. I've created a POC (attached) that currently waits indefinitely for the remote slot to catch up. We could later add a timeout parameter to control maximum wait time if this approach seems acceptable. I tested that, when pgbench TPC-B is running on the primary, calling pg_sync_replication_slots() on the standby correctly blocks until I advance the primary slot position by calling pg_logical_xx_get_changes(). if the basic idea sounds reasonable then I can start a separate thread to extend this API. Thoughts ? Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: