RE: Conflict detection for update_deleted in logical replication
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Conflict detection for update_deleted in logical replication |
Дата | |
Msg-id | TY4PR01MB169075232AE3A6E643F0EB749940EA@TY4PR01MB16907.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Conflict detection for update_deleted in logical replication
|
Список | pgsql-hackers |
On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Here is V71 patch set which addressed above comments and [1]. > > > > IIUC, this patch after stopping the retention, it immediately starts retrying to > resume by transitioning through various phases. This can consume CPU and > network resources, if the apply_worker takes a long time to catch up. I agree. I think one way is to increase the interval in each cycle when the retention has been stopped and the worker is retrying to resume the retention. I have updated the patch for the same. > > Few other comments: > 1. > + /* > + * Return if the launcher has not initialized oldest_nonremovable_xid. > + * > + * It might seem feasible to directly check the conflict detection > + * slot.xmin instead of relying on the launcher to assign the worker's > + * oldest_nonremovable_xid; however, that could lead to a race > + condition > + * where slot.xmin is set to InvalidTransactionId immediately after the > + * check. In such cases, oldest_nonremovable_xid would no longer be > + * protected by a replication slot and could become unreliable if a > + * wraparound occurs. > + */ > + if (!TransactionIdIsValid(nonremovable_xid)) > + return; > > I understand the reason why we get assigned the worker's non_removable_xid > from launcher but all this makes the code complex to understand. Isn't there > any other way to achieve it? One naïve and inefficient way could be to just > restart the worker probably after updating its retentionactive flag. I am not > suggesting to make this change but just a brainstorming point. I'll keep thinking about it, and for now, I've added a comment mentioning that rebooting is a simpler solution. > > 2. The function should_stop_conflict_info_retention() is invoked from > wait_for_local_flush() and then it can lead further state transitioning which > doesn't appear neat and makes code difficult to understand. I changed the logic to avoid proceeding to next phase when the retention is stopped. > > 3. > + /* > + * If conflict info retention was previously stopped due to a timeout, > + and > + * the time required to advance the non-removable transaction ID has > + now > + * decreased to within acceptable limits, resume the rentention. > + */ > + if (!MySubscription->retentionactive) > + { > + rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION; > + process_rdt_phase_transition(rdt_data, false); return; } > > In this check, where do we check the time has come in acceptable range? Can > you update comments to make it clear? I updated the comments to mention that the code can reach here only when the time is within max_retention_duration. Here is the V72 patch set which addressed above and Shveta's comments[1]. [1] https://www.postgresql.org/message-id/CAJpy0uDw7SmCN_jOvpNUzo_sE4ZsgpqQ5_vHLjm4aCm10eBApA%40mail.gmail.com Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: