Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
От | Bertrand Drouvot |
---|---|
Тема | Re: Fix race condition in InvalidatePossiblyObsoleteSlot() |
Дата | |
Msg-id | ZakzrE6Oib9N5Nar@ip-10-97-1-34.eu-west-3.compute.internal обсуждение исходный текст |
Ответ на | Re: Fix race condition in InvalidatePossiblyObsoleteSlot() (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Ответы |
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
|
Список | pgsql-hackers |
Hi, On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > IIUC, the issue is that we terminate the process holding the > replication slot, and the conflict cause that we recorded before > terminating the process changes in the next iteration due to the > advancement in effective_xmin and/or effective_catalog_xmin. Thanks for looking at it! Yeah, and that could lead to no conflict detected anymore (like in the case [2] reported up-thread). > FWIW, a test code something like [1], can help detect above race issues, right? I think so and I've added it in v2 attached (except that it uses the new "terminated" variable, see below), thanks! > Some comments on the patch: > > 1. > last_signaled_pid = active_pid; > + terminated = true; > } > > Why is a separate variable needed? Can't last_signaled_pid != 0 enough > to determine that a process was terminated earlier? Yeah probably, I thought about it but preferred to add a new variable for this purpose for clarity and avoid race conditions (in case futur patches "touch" the last_signaled_pid anyhow). I was thinking that this part of the code is already not that easy. > 2. If my understanding of the racy behavior is right, can the same > issue happen due to the advancement in restart_lsn? I'm not sure as I never saw it but it should not hurt to also consider this "potential" case so it's done in v2 attached. > I'm not sure if it > can happen at all, but I think we can rely on previous conflict reason > instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. I'm not sure what you mean here. I think we should still keep the "initial" LSN so that the next check done with it still makes sense. The previous conflict reason as you're proposing also makes sense to me but for another reason: PANIC in case the issue still happen (for cases we did not think about, means not covered by what the added previous LSNs are covering). > 3. Is there a way to reproduce this racy issue, may be by adding some > sleeps or something? If yes, can you share it here, just for the > records and verify the whatever fix provided actually works? Alexander was able to reproduce it on a slow machine and the issue was not there anymore with v1 in place. I think it's tricky to reproduce as it would need the slot to advance between the 2 checks. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: