Обсуждение: Apparent deadlock 7.0.1

Поиск
Список
Период
Сортировка

Apparent deadlock 7.0.1

От
Grim
Дата:
Hi

I have noticed a deadlock happening on 7.0.1 on updates.

The backends just lock, and take up as much CPU as they can. I kill
the postmaster, and the backends stay alive, using CPU at the highest
rate possible. The operations arent that expensive, just a single line
of update.

Anyone else seen this? Anyone dealing with this?
If not, I will start to try and get some debug information.

Also, I tried to make an index and had the following problem

search=# select count(*) from search_word_te;count 
-------71864
(1 row)

search=# create index search_word_te_index on search_word_te (word,wordnum);
ERROR:  btree: index item size 3040 exceeds maximum 2717

What is this all about? It worked fine on 6.5.2
                    ~Michael


Re: Apparent deadlock 7.0.1

От
Tom Lane
Дата:
Grim <grim@ewtoo.org> writes:
> I have noticed a deadlock happening on 7.0.1 on updates.
> The backends just lock, and take up as much CPU as they can. I kill
> the postmaster, and the backends stay alive, using CPU at the highest
> rate possible. The operations arent that expensive, just a single line
> of update.
> Anyone else seen this? Anyone dealing with this?

News to me.  What sort of hardware are you running on?  It sort of
sounds like the spinlock code not working as it should --- and since
spinlocks are done with platform-dependent assembler, it matters...

> search=# create index search_word_te_index on search_word_te (word,wordnum);
> ERROR:  btree: index item size 3040 exceeds maximum 2717
> What is this all about? It worked fine on 6.5.2

If you had the same data in 6.5.2 then you were living on borrowed time.
The btree code assumes it can fit at least three keys per page, and if
you have some keys > 1/3 page then sooner or later three of them will
need to be stored on the same page.  6.5.2 didn't complain in advance,
it just crashed hard when that situation came up.  7.0 prevents the
problem by not letting you store an oversized key to begin with.

(Hopefully all these tuple-size-related problems will go away in 7.1.)
        regards, tom lane


Re: Apparent deadlock 7.0.1

От
Michael Simms
Дата:
Tom Lane wrote:
> 
> Grim <grim@ewtoo.org> writes:
> > I have noticed a deadlock happening on 7.0.1 on updates.
> > The backends just lock, and take up as much CPU as they can. I kill
> > the postmaster, and the backends stay alive, using CPU at the highest
> > rate possible. The operations arent that expensive, just a single line
> > of update.
> > Anyone else seen this? Anyone dealing with this?
> 
> News to me.  What sort of hardware are you running on?  It sort of
> sounds like the spinlock code not working as it should --- and since
> spinlocks are done with platform-dependent assembler, it matters...

The hardware/software is:

Linux kernel 2.2.15 (SMP kernel)
Glibc  2.1.1
Dual Intel PIII/500

There are usually about 30 connections to the database at any one time.

> The btree code assumes it can fit at least three keys per page, and if
> you have some keys > 1/3 page then sooner or later three of them will
> need to be stored on the same page.  6.5.2 didn't complain in advance,
> it just crashed hard when that situation came up.  7.0 prevents the
> problem by not letting you store an oversized key to begin with.

Ahhh, it was the tuple size, I thought it meant the number of records in
the index or something, seeing as coincidentally that was the biggest
table.

Deleted one row of 3K, and all works fine now, thanks!
                ~Michael


Re: Apparent deadlock 7.0.1

От
Tom Lane
Дата:
Michael Simms <grim@ewtoo.org> writes:
>>>> I have noticed a deadlock happening on 7.0.1 on updates.
>>>> The backends just lock, and take up as much CPU as they can. I kill
>>>> the postmaster, and the backends stay alive, using CPU at the highest
>>>> rate possible. The operations arent that expensive, just a single line
>>>> of update.
>>>> Anyone else seen this? Anyone dealing with this?
>> 
>> News to me.  What sort of hardware are you running on?  It sort of
>> sounds like the spinlock code not working as it should --- and since
>> spinlocks are done with platform-dependent assembler, it matters...

> The hardware/software is:

> Linux kernel 2.2.15 (SMP kernel)
> Glibc  2.1.1
> Dual Intel PIII/500

Dual CPUs huh?  I have heard of motherboards that have (misdesigned)
memory caching such that the two CPUs don't reliably see each others'
updates to a shared memory location.  Naturally that plays hell with the
spinlock code :-(.  It might be necessary to insert some kind of cache-
flushing instruction into the spinlock wait loop to ensure that the
CPUs see each others' changes to the lock.

This is all theory at this point, and a hole in the theory is that the
backends ought to give up with a "stuck spinlock" error after a minute
or two of not being able to grab the lock.  I assume you have left them
go at it for longer than that without seeing such an error?

Anyway, the next step is to "kill -ABORT" some of the stuck processes
and get backtraces from their coredumps to see where they are stuck.
If you find they are inside s_lock() then it's definitely some kind of
spinlock problem.  If not...
        regards, tom lane