Обсуждение: Apparent deadlock 7.0.1
Hi I have noticed a deadlock happening on 7.0.1 on updates. The backends just lock, and take up as much CPU as they can. I kill the postmaster, and the backends stay alive, using CPU at the highest rate possible. The operations arent that expensive, just a single line of update. Anyone else seen this? Anyone dealing with this? If not, I will start to try and get some debug information. Also, I tried to make an index and had the following problem search=# select count(*) from search_word_te;count -------71864 (1 row) search=# create index search_word_te_index on search_word_te (word,wordnum); ERROR: btree: index item size 3040 exceeds maximum 2717 What is this all about? It worked fine on 6.5.2 ~Michael
Grim <grim@ewtoo.org> writes: > I have noticed a deadlock happening on 7.0.1 on updates. > The backends just lock, and take up as much CPU as they can. I kill > the postmaster, and the backends stay alive, using CPU at the highest > rate possible. The operations arent that expensive, just a single line > of update. > Anyone else seen this? Anyone dealing with this? News to me. What sort of hardware are you running on? It sort of sounds like the spinlock code not working as it should --- and since spinlocks are done with platform-dependent assembler, it matters... > search=# create index search_word_te_index on search_word_te (word,wordnum); > ERROR: btree: index item size 3040 exceeds maximum 2717 > What is this all about? It worked fine on 6.5.2 If you had the same data in 6.5.2 then you were living on borrowed time. The btree code assumes it can fit at least three keys per page, and if you have some keys > 1/3 page then sooner or later three of them will need to be stored on the same page. 6.5.2 didn't complain in advance, it just crashed hard when that situation came up. 7.0 prevents the problem by not letting you store an oversized key to begin with. (Hopefully all these tuple-size-related problems will go away in 7.1.) regards, tom lane
Tom Lane wrote: > > Grim <grim@ewtoo.org> writes: > > I have noticed a deadlock happening on 7.0.1 on updates. > > The backends just lock, and take up as much CPU as they can. I kill > > the postmaster, and the backends stay alive, using CPU at the highest > > rate possible. The operations arent that expensive, just a single line > > of update. > > Anyone else seen this? Anyone dealing with this? > > News to me. What sort of hardware are you running on? It sort of > sounds like the spinlock code not working as it should --- and since > spinlocks are done with platform-dependent assembler, it matters... The hardware/software is: Linux kernel 2.2.15 (SMP kernel) Glibc 2.1.1 Dual Intel PIII/500 There are usually about 30 connections to the database at any one time. > The btree code assumes it can fit at least three keys per page, and if > you have some keys > 1/3 page then sooner or later three of them will > need to be stored on the same page. 6.5.2 didn't complain in advance, > it just crashed hard when that situation came up. 7.0 prevents the > problem by not letting you store an oversized key to begin with. Ahhh, it was the tuple size, I thought it meant the number of records in the index or something, seeing as coincidentally that was the biggest table. Deleted one row of 3K, and all works fine now, thanks! ~Michael
Michael Simms <grim@ewtoo.org> writes: >>>> I have noticed a deadlock happening on 7.0.1 on updates. >>>> The backends just lock, and take up as much CPU as they can. I kill >>>> the postmaster, and the backends stay alive, using CPU at the highest >>>> rate possible. The operations arent that expensive, just a single line >>>> of update. >>>> Anyone else seen this? Anyone dealing with this? >> >> News to me. What sort of hardware are you running on? It sort of >> sounds like the spinlock code not working as it should --- and since >> spinlocks are done with platform-dependent assembler, it matters... > The hardware/software is: > Linux kernel 2.2.15 (SMP kernel) > Glibc 2.1.1 > Dual Intel PIII/500 Dual CPUs huh? I have heard of motherboards that have (misdesigned) memory caching such that the two CPUs don't reliably see each others' updates to a shared memory location. Naturally that plays hell with the spinlock code :-(. It might be necessary to insert some kind of cache- flushing instruction into the spinlock wait loop to ensure that the CPUs see each others' changes to the lock. This is all theory at this point, and a hole in the theory is that the backends ought to give up with a "stuck spinlock" error after a minute or two of not being able to grab the lock. I assume you have left them go at it for longer than that without seeing such an error? Anyway, the next step is to "kill -ABORT" some of the stuck processes and get backtraces from their coredumps to see where they are stuck. If you find they are inside s_lock() then it's definitely some kind of spinlock problem. If not... regards, tom lane