Re: hung backends stuck in spinlock heavy endless loop

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: hung backends stuck in spinlock heavy endless loop
Дата
Msg-id CAHyXU0zj5wedt0eXo3jMjKshQV-mUhLdcOEuOaD9caZ4OucPFA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: hung backends stuck in spinlock heavy endless loop  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: hung backends stuck in spinlock heavy endless loop  (Merlin Moncure <mmoncure@gmail.com>)
Список pgsql-hackers
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
>> The plot thickens!  I looped the test, still stock 9.4 as of this time
>> and went to lunch. When I came back, the database was in recovery
>> mode.  Here is the rough sequence of events.
>>
>
> Whoa. That looks scary. Did you see (some of) those errors before? Most
> of them should have been emitted independently of being built with
> assertions.

Nope.

>> 1) REINDEXED pg_class (wanted clean slate for full reproduction)
>> 2) before too long (unfortunately did not configure timestamps in the
>> log) starting seeing:
>>
>> ERROR:  root page 3 of index "pg_class_oid_index" has level 0,
>> expected 1 at character 8
>> QUERY:  UPDATE CDSRunTable SET
>>           Finished = clock_timestamp(),
>>           DidSucceed = _DidSucceed,
>>           ErrorMessage = _ErrorMessage
>>         WHERE CDSRunTableId = _CdsRunTableId
>> CONTEXT:  PL/pgSQL function cdsreconcileruntable(bigint) line 82 at
>> SQL statement
>> STATEMENT:  SELECT CDSReconcileRunTable(2020)
>>
>> ..and again with a FATAL
>> FATAL:  root page 3 of index "pg_class_oid_index" has level 0, expected 1
>> ERROR:  root page 3 of index "pg_class_oid_index" has level 0, expected 1
>> CONTEXT:  SQL statement "UPDATE CDSRunTable SET
>>           Finished = clock_timestamp(),
>>           DidSucceed = _DidSucceed,
>>           ErrorMessage = _ErrorMessage
>>         WHERE CDSRunTableId = _CdsRunTableId"
>>     PL/pgSQL function cdsreconcileruntable(bigint) line 82 at SQL statement
>
> Just to be absolutely clear, those happened *before* the following
> errors? And there were no 'during exception cleanup' like errors before?

correct.  unfortunately, I wasn't logging times so I'm not 100% sure
if the 'root page' error happened during the event or sometime before
it.  I'm addressing that with the log prefix.

> Were there any 'LOG: Handling deadlock detected on CdsRunTableId' log
> entries before? It's hard to know from here, but the 'during exception
> cleanup' indicates a problem in abort handling.  Were there any deadlock
> detected errors closeby?
>
> You're catching deadlock errors in a subtransaction. Hm.

yup, exactly 1:
WARNING:  LOG: Handling deadlock detected on CdsRunTableId 1626

Based on the id that is proven to be before the crashing started,
although at least one manual server restart happened in the interim.

Since it's possible the database is a loss, do you see any value in
bootstrappinng it again with checksums turned on?  One point of note
is that this is a brand spanking new SSD, maybe we nee to rule out
hardware based corruption?

> lternatively were there any 'LOG: CdsRunTableId %s Failed' messages? If
so, what was the cause?

There was one, but unfortunately since the error was trapped the
relevant detail to catch the other side of the deadlock is missed.
Barring this analysis, I'm inclined to remove that guard: it was
mostly there to deal with what turned out to be some bugs in the code
(specifically nonconstrained update on CDSTable).

> * Do you also use lock_timeout/statement_timeout?
not on this database -- this is a coding environment and not set up
with standard configuration

>  were any processes killed at that time?
later on, yes, by manual restart, and the query was cancelled first: to wit:

WARNING:  LOG: Handling deadlock detected on CdsRunTableId 1626
ERROR:  canceling statement due to user request
STATEMENT:  SELECT CDSReconcileRunTable(1626)

followed by restart -m fast

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [BUGS] BUG #12070: hstore extension: hstore_to_json_loose produces invalid JSON
Следующее
От: Robert Haas
Дата:
Сообщение: Re: advance local xmin more aggressively