RE: Conflict detection for update_deleted in logical replication
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Conflict detection for update_deleted in logical replication |
Дата | |
Msg-id | OS0PR01MB5716163486D6131F5B77A47B946CA@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
RE: Conflict detection for update_deleted in logical replication
Re: Conflict detection for update_deleted in logical replication Re: Conflict detection for update_deleted in logical replication |
Список | pgsql-hackers |
On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote: > > On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Attaching the V32 patch set which addressed comments in [1]~[5]. > > > > Review comments: > =============== > * > +advance_conflict_slot_xmin(FullTransactionId new_xmin) { > +FullTransactionId full_xmin; FullTransactionId next_full_xid; > + > + Assert(MyReplicationSlot); > + Assert(FullTransactionIdIsValid(new_xmin)); > + > + next_full_xid = ReadNextFullTransactionId(); > + > + /* > + * Compute FullTransactionId for the current xmin. This handles the > + case > + * where transaction ID wraparound has occurred. > + */ > + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid, > + MyReplicationSlot->data.xmin); > + > + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return; > > The above code suggests that the launcher could compute a new xmin that is > less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen > when the user toggles retain_conflict_info flag at some random time when the > launcher is trying to compute the new xmin value for the slot. One of the > possible combinations of steps for this race could be as follows: > > 1. The subscriber has two subscriptions, A and B. Subscription A has > retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the > launcher calls get_subscription_list(), and worker A is already alive. > 3. Assuming the apply worker will restart on changing retain_conflict_info, the > user enables retain_conflict_info for subscription B. > 4. The launcher processes the subscription B first in the first cycle, and starts > worker B. Say, worker B gets 759 as candidate_xid. > 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn > happens, worker A gets 760 as candidate_xid and updates it to > oldest_nonremovable_xid. > 7. The launcher processes the subscription A in the first cycle, and the final > xmin value is 760, because it only checks the oldest_nonremovable_xid from > worker A. The launcher then updates the value to slot.xmin. > 8. In the next cycle, the launcher finds that worker B has an older > oldest_nonremovable_xid 759, so the minimal xid would now be 759. The > launher would have retreated the slot's xmin unless we had the above check in > the quoted code. > > I think the above race is possible because the system lets the changed > subscription values of a subscription take effect asynchronously by workers. > The one more similar race condition handled by the patch is as follows: > > * > ... > + * It's necessary to use FullTransactionId here to mitigate potential > + race > + * conditions. Such scenarios might occur if the replication slot is > + not > + * yet created by the launcher while the apply worker has already > + * initialized this field. During this period, a transaction ID > + wraparound > + * could falsely make this ID appear as if it originates from the > + future > + * w.r.t the transaction ID stored in the slot maintained by launcher. > + See > + * advance_conflict_slot_xmin. > ... > + FullTransactionId oldest_nonremovable_xid; > > This case can happen if the user disables and immediately enables the > retain_conflict_info option. In this case, the launcher may drop the slot after > noticing the disable. But the apply worker may not notice the disable and it only > notices that the retain_conflict_info is still enabled, so it will keep maintaining > oldest_nonremovable_xid when the slot is not created. > > It is okay to handle both the race conditions, but I am worried we may miss > some such race conditions which could lead to difficult-to-find bugs. So, at > least for the first version of this option (aka for patches 0001 to 0003), we can > add a condition that allows us to change retain_conflict_info only on disabled > subscriptions. This will simplify the patch. Agreed. > We can make a separate patch to > allow changing retain_conflict_info option for enabled subscriptions. That will > make it easier to evaluate such race conditions and the solutions more deeply. I will prepare a separate patch soon. Here is the V33 patch set which includes the following changes: 0001: * Renaming and typo fixes based on Shveta's comments [1] * Comment changes suggested by Amit [2] * Changed oldest_nonremoable_xid from FullTransactionID to TransactionID. * Code refactoring in drop_conflict_slot_if_exists() 0002: * Documentation updates suggested by Amit [2] * Code modifications to adapt to TransactionID oldest_nonremoable_xid 0003: * Documentation improvements suggested by Shveta [3] * Added restriction: disallow changing retain_conflict_info when sub is enabled or worker is alive 0004: * Simplified race condition handling due to the new restriction from 0003 0005: * Code updates to accommodate both the TransactionID type for oldest_nonremoable_xid and the new restriction from 0003 0006: * New test case for the restriction introduced in 0003 0007: No changes [1] https://www.postgresql.org/message-id/CAJpy0uBSsRuVOeuo-i8R_aO0CMiORHTnEBZ9z-TDq941WqhyLA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1KUTHbgroBRNp8_dy3Lrc%2BetPm19O1RcyRcDBgCp7EFcg%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAJpy0uAJUTmSx7fAE3gbnBUzp9ZDOgkLrP5gdoysKUGbvf7vGg%40mail.gmail.com Best Regards, Hou zj
Вложения
- v33-0007-Support-the-conflict-detection-for-update_delete.patch
- v33-0001-Maintain-the-oldest-non-removable-transaction-ID.patch
- v33-0002-Maintain-the-replication-slot-in-logical-launche.patch
- v33-0003-Add-a-retain_conflict_info-option-to-subscriptio.patch
- v33-0004-Introduce-a-new-GUC-max_conflict_retention_durat.patch
- v33-0005-Re-create-the-replication-slot-if-the-conflict-r.patch
- v33-0006-Add-a-tap-test-to-verify-the-management-of-the-n.patch
В списке pgsql-hackers по дате отправления: