Re: hung backends stuck in spinlock heavy endless loop

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: hung backends stuck in spinlock heavy endless loop
Дата
Msg-id CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: hung backends stuck in spinlock heavy endless loop  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: hung backends stuck in spinlock heavy endless loop  (Jeff Janes <jeff.janes@gmail.com>)
Re: hung backends stuck in spinlock heavy endless loop  (Peter Geoghegan <pg@heroku.com>)
Re: hung backends stuck in spinlock heavy endless loop  (Martijn van Oosterhout <kleptog@svana.org>)
Re: hung backends stuck in spinlock heavy endless loop  (Merlin Moncure <mmoncure@gmail.com>)
Список pgsql-hackers
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search.  If that doesn't turn up anything
>> productive I'll look into taking other steps.
>
> That might be the quickest way to do it, provided you can isolate the
> bug fairly reliably. It might be a bit tricky to write a shell script
> that assumes a certain amount of time having passed without the bug
> tripping indicates that it doesn't exist, and have that work
> consistently. I'm slightly concerned that you'll hit other bugs that
> have since been fixed, given the large number of possible symptoms
> here.

Quick update:  not done yet, but I'm making consistent progress, with
several false starts.  (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the feature.).
I have to triple check everything :(. The problem is generally
reproducible but I get false negatives that throws off the bisection.
I estimate that early next week I'll have it narrowed down
significantly if not to the exact offending revision.

So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent.  For example:

[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING:  page
verification failed, calculated checksum 9465 but expected 9477 at
character 20
[cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING:  page
verification failed, calculated checksum 61889 but expected 61903 at
character 20
[cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING:  page
verification failed, calculated checksum 27311 but expected 27316

I'm not up on the intricacies of our checksum algorithm but this is
making me suspicious that we are looking at a improperly flipped
visibility bit via some obscure problem -- almost certainly with
vacuum playing a role.  This fits the profile of catastrophic damage
that masquerades as numerous other problems.  Or, perhaps, something
is flipping what it thinks is a visibility bit but on the wrong page.

I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.

In the 'plus' category, aside from flushing out this issue, I've had
zero runtime problems so far aside from the mains problem; bisection
(at least on the 'bad' side) has been reliably engaged by simply
counting the number of warnings/errors/etc in the log.  That's really
impressive.

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: David G Johnston
Дата:
Сообщение: Re: Proposal: knowing detail of config files via SQL
Следующее
От: Andres Freund
Дата:
Сообщение: Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?