Re: subscriptionCheck failures on nightjar
От | Tomas Vondra |
---|---|
Тема | Re: subscriptionCheck failures on nightjar |
Дата | |
Msg-id | 20190826132904.3ayuw36qzl2c4ktr@development обсуждение исходный текст |
Ответ на | Re: subscriptionCheck failures on nightjar (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: subscriptionCheck failures on nightjar
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: subscriptionCheck failures on nightjar (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote: >On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote: >> I'm not yet sure that that's actually something that's supposed to >> happen, I got to spend some time analysing how this actually >> happens. Normally the contents of the slot should actually prevent it >> from being removed (as they're newer than >> ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's >> a bug in the drop logic in newer releases. > >In the same context, could it be a consequence of 9915de6c which has >introduced a conditional variable to control slot operations? This >could have exposed more easily a pre-existing race condition. >-- This is one of the remaining open items, and we don't seem to be moving forward with it :-( I'm willing to take a stab at it, but to do that I need a way to reproduce it. Tom, you mentioned you've managed to reproduce it in a qemu instance, but that it took some fiddling with qemu parmeters or something. Can you share what exactly was necessary? An observation about the issue - while we started to notice this after Decemeber, that's mostly because the PANIC patch went it shortly before. We've however seen the issue before, as Thomas Munro mentioned in [1]. Those reports are from August, so it's quite possible something in the first CF upset the code. And there's only a single commit in 2018-07 that seems related to logical decoding / snapshots [2], i.e. f49a80c: commit f49a80c481f74fa81407dce8e51dea6956cb64f8 Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Tue Jun 26 16:38:34 2018 -0400 Fix "base" snapshot handling in logical decoding ... The other reason to suspect this is related is that the fix also made it to REL_11_STABLE at that time, and if you check the buildfarm data [3], you'll see 11 fails on nightjar too, from time to time. This means it's not a 12+ only issue, it's a live issue on 11. I don't know if f49a80c is the culprit, or if it simply uncovered a pre-existing bug (e.g. due to timing). [1] https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail.gmail.com [2] https://commitfest.postgresql.org/18/1650/ [3] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: