Re: hung backends stuck in spinlock heavy endless loop
От | Merlin Moncure |
---|---|
Тема | Re: hung backends stuck in spinlock heavy endless loop |
Дата | |
Msg-id | CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: hung backends stuck in spinlock heavy endless loop (Peter Geoghegan <pg@heroku.com>) |
Ответы |
Re: hung backends stuck in spinlock heavy endless loop
(Jeff Janes <jeff.janes@gmail.com>)
Re: hung backends stuck in spinlock heavy endless loop (Peter Geoghegan <pg@heroku.com>) Re: hung backends stuck in spinlock heavy endless loop (Martijn van Oosterhout <kleptog@svana.org>) Re: hung backends stuck in spinlock heavy endless loop (Merlin Moncure <mmoncure@gmail.com>) |
Список | pgsql-hackers |
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> ISTM the next step is to bisect the problem down over the weekend in >> order to to narrow the search. If that doesn't turn up anything >> productive I'll look into taking other steps. > > That might be the quickest way to do it, provided you can isolate the > bug fairly reliably. It might be a bit tricky to write a shell script > that assumes a certain amount of time having passed without the bug > tripping indicates that it doesn't exist, and have that work > consistently. I'm slightly concerned that you'll hit other bugs that > have since been fixed, given the large number of possible symptoms > here. Quick update: not done yet, but I'm making consistent progress, with several false starts. (for example, I had a .conf problem with the new dynamic shared memory setting and git merrily bisected down to the introduction of the feature.). I have to triple check everything :(. The problem is generally reproducible but I get false negatives that throws off the bisection. I estimate that early next week I'll have it narrowed down significantly if not to the exact offending revision. So far, the 'nasty' damage seems to generally if not always follow a checksum failure and the checksum failures are always numerically adjacent. For example: [cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page verification failed, calculated checksum 9465 but expected 9477 at character 20 [cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING: page verification failed, calculated checksum 61889 but expected 61903 at character 20 [cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING: page verification failed, calculated checksum 27311 but expected 27316 I'm not up on the intricacies of our checksum algorithm but this is making me suspicious that we are looking at a improperly flipped visibility bit via some obscure problem -- almost certainly with vacuum playing a role. This fits the profile of catastrophic damage that masquerades as numerous other problems. Or, perhaps, something is flipping what it thinks is a visibility bit but on the wrong page. I still haven't categorically ruled out pl/sh yet; that's something to keep in mind. In the 'plus' category, aside from flushing out this issue, I've had zero runtime problems so far aside from the mains problem; bisection (at least on the 'bad' side) has been reliably engaged by simply counting the number of warnings/errors/etc in the log. That's really impressive. merlin
В списке pgsql-hackers по дате отправления:
Предыдущее
От: David G JohnstonДата:
Сообщение: Re: Proposal: knowing detail of config files via SQL
Следующее
От: Andres FreundДата:
Сообщение: Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?