Re: hung backends stuck in spinlock heavy endless loop

Поиск

Список

Период

Сортировка

От	Merlin Moncure
Тема	Re: hung backends stuck in spinlock heavy endless loop
Дата	22 января 2015 г. 21:50:10
Msg-id	CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: hung backends stuck in spinlock heavy endless loop (Peter Geoghegan <pg@heroku.com>)
Ответы	Re: hung backends stuck in spinlock heavy endless loop Re: hung backends stuck in spinlock heavy endless loop Re: hung backends stuck in spinlock heavy endless loop Re: hung backends stuck in spinlock heavy endless loop
Список	pgsql-hackers

Дерево обсуждения

On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search.  If that doesn't turn up anything
>> productive I'll look into taking other steps.
>
> That might be the quickest way to do it, provided you can isolate the
> bug fairly reliably. It might be a bit tricky to write a shell script
> that assumes a certain amount of time having passed without the bug
> tripping indicates that it doesn't exist, and have that work
> consistently. I'm slightly concerned that you'll hit other bugs that
> have since been fixed, given the large number of possible symptoms
> here.

Quick update:  not done yet, but I'm making consistent progress, with
several false starts.  (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the feature.).
I have to triple check everything :(. The problem is generally
reproducible but I get false negatives that throws off the bisection.
I estimate that early next week I'll have it narrowed down
significantly if not to the exact offending revision.

So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent.  For example:

[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING:  page
verification failed, calculated checksum 9465 but expected 9477 at
character 20
[cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING:  page
verification failed, calculated checksum 61889 but expected 61903 at
character 20
[cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING:  page
verification failed, calculated checksum 27311 but expected 27316

I'm not up on the intricacies of our checksum algorithm but this is
making me suspicious that we are looking at a improperly flipped
visibility bit via some obscure problem -- almost certainly with
vacuum playing a role.  This fits the profile of catastrophic damage
that masquerades as numerous other problems.  Or, perhaps, something
is flipping what it thinks is a visibility bit but on the wrong page.

I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.

In the 'plus' category, aside from flushing out this issue, I've had
zero runtime problems so far aside from the mains problem; bisection
(at least on the 'bad' side) has been reliably engaged by simply
counting the number of warnings/errors/etc in the log.  That's really
impressive.

merlin

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: hung backends stuck in spinlock heavy endless loop