Обсуждение: total db lockup
(NOTE: reposting this for the *fifth* time because my previous messages didn't go through). Hi all, We have experienced a really weird problem with postgresql yesterday. When I was called in to take a look, all the non-superuser connections were used up and they were all in a waiting state (SELECT waiting, UPDATE waiting, etc.). I couldn't figure out what the problem is, but I saved the ps, pg_stat_activity, and pg_locks state at the time (attached as condor_db_stats.txt). BTW, when looking at the queries in pg_stat_activity, postgresql cuts them off so you can't see the entire string. Is there any way to prevent that, or at least increase the character limit? We restarted postgresql server and the problem reappeared a few minutes later. I saved that state also (condor_db_stats2.txt). At that point, after we restarted postgresql (again), I ran VACUUM on the entire database and did a few more things that seem to have solved the problem (see below). This particular database is essentially just one flat table (level) with a few small supporting tables. Only the level table is heavily used. (table definition is attached as table.txt). I noticed that one of the indexes (level_owner_index) was a hash index. I remembered what postgresql manual says about hash indexes and concurrency (http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html) and, after VACUUM finished, replaced the hash index with a btree. I then did a REINDEX of the level table and ANALYZE. This seems to have solved the problem -- at least as of this morning we still have not seen any deadlocks. My question is, what could have caused this to happen? Can anyone explain this paragraph from the manual: "Share/exclusive page-level locks are used for read/write access. Locks are released after the page is processed. Page-level locks provide better concurrency than index-level ones but are liable to deadlocks." Any other pointers to help me figure out what went wrong and how to fix it? thanks, Eugene WTF? My message doesn't appear. Trying again without attachments or inline text.
So can anyone offer any insight on this? BTW, I tried reposting it with attachments and it didn't show up. thanks, Eugene > > From: <eugene1@sympatico.ca> > Date: 2005/08/18 Thu AM 09:24:30 EST > To: <pgsql-general@postgresql.org> > Subject: [GENERAL] total db lockup > > (NOTE: reposting this for the *fifth* time because my previous messages didn't go through). > > > Hi all, > > We have experienced a really weird problem with > postgresql yesterday. When I was called in to take a > look, all the non-superuser connections were used up > and they were all in a waiting state (SELECT waiting, > UPDATE waiting, etc.). I couldn't figure out what the > problem is, but I saved the ps, pg_stat_activity, and > pg_locks state at the time (attached as > condor_db_stats.txt). BTW, when looking at the queries > in pg_stat_activity, postgresql cuts them off so you > can't see the entire string. Is there any way to > prevent that, or at least increase the character > limit? > > We restarted postgresql server and the problem > reappeared a few minutes later. I saved that state > also (condor_db_stats2.txt). At that point, after we > restarted postgresql (again), I ran VACUUM on the > entire database and did a few more things that seem to > have solved the problem (see below). > > This particular database is essentially just one flat > table (level) with a few small supporting tables. Only > the level table is heavily used. (table definition is > attached as table.txt). I noticed that one of the > indexes (level_owner_index) was a hash index. I > remembered what postgresql manual says about hash > indexes and concurrency > (http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html) > and, after VACUUM finished, replaced the hash index > with a btree. I then did a REINDEX of the level table > and ANALYZE. This seems to have solved the problem -- > at least as of this morning we still have not seen any > deadlocks. > > My question is, what could have caused this to happen? > Can anyone explain this paragraph from the manual: > > "Share/exclusive page-level locks are used for > read/write access. Locks are released after the page > is processed. Page-level locks provide better > concurrency than index-level ones but are liable to > deadlocks." > > > Any other pointers to help me figure out what went > wrong and how to fix it? > > thanks, > > Eugene > > WTF? My message doesn't appear. Trying again without > attachments or inline text. > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
<eugene1@sympatico.ca> writes: > So can anyone offer any insight on this? Not without information. You haven't even told us what PG version you are running, much less provided the necessary details like the pg_locks status. > BTW, I tried reposting it with attachments and it didn't show up. Fix your mail setup and try again. Or wait a bit --- it's quite possible the previous messages are just stuck in the moderator's approval queue. regards, tom lane
It seems that I can't. For whatever reason, my messages are being blocked. Eugene > > From: Tom Lane <tgl@sss.pgh.pa.us> > Date: 2005/08/18 Thu PM 02:11:54 EST > To: eugene1@sympatico.ca > CC: pgsql-general@postgresql.org > Subject: Re: [GENERAL] total db lockup > >
Вложения
Guys, I really need help on this. Can whoever is in charge of the mailing list change it so that it doesn't delete my messages? thanks, Eugene > > From: Eugene <eugene1@sympatico.ca> > Date: 2005/08/18 Thu PM 04:45:58 EST > To: Tom Lane <tgl@sss.pgh.pa.us> > CC: <pgsql-general@postgresql.org> > Subject: Re: [GENERAL] total db lockup > > It seems that I can't. For whatever reason, my messages are being blocked. > > Eugene > > > > > > From: Tom Lane <tgl@sss.pgh.pa.us> > > Date: 2005/08/18 Thu PM 02:11:54 EST > > To: eugene1@sympatico.ca > > CC: pgsql-general@postgresql.org > > Subject: Re: [GENERAL] total db lockup > > > > > > ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
Вложения
Well, perhaps they were blocked for being too large? Seriously, some of your messages appear to be getting through fine so if you can't attach them put them on a web or ftp server and post a link. Much better than copying it to several hundred mailboxes. On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote: > Guys, I really need help on this. Can whoever is in charge of the > mailing list change it so that it doesn't delete my messages? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote: > Guys, I really need help on this. Can whoever is in charge of the > mailing list change it so that it doesn't delete my messages? I am not in charge of the mail server. However: May I suggest you change to a less broken mail client? Your messages show up featuring really strange MIME content description. Even the standard gmail.com accounts seem to work better, which is surprising for a web based email setup. This brokenness may hint the spam processor on postgresql.org servers to silently droo your emails. Alternatively, you may want to "paste" your error messages and stuff on places like http://rafb.net/paste, and then post the URLs here. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "El sentido de las cosas no viene de las cosas, sino de las inteligencias que las aplican a sus problemas diarios en busca del progreso." (Ernesto Hernández-Novich)
Thanks Alvaro. Here it is again: problem description: http://rafb.net/paste/results/bLAtIk26.html db state before first restart: http://rafb.net/paste/results/D1Bqe125.html db state before second restart: http://rafb.net/paste/results/D1Bqe125.html table definition: http://rafb.net/paste/results/W35ccD49.html thanks, Eugene > > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: 2005/08/18 Thu PM 08:52:17 EST > To: Eugene <eugene1@sympatico.ca> > CC: pgsql-general@postgresql.org > Subject: Re: [GENERAL] total db lockup > > On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote: > > Guys, I really need help on this. Can whoever is in charge of the > > mailing list change it so that it doesn't delete my messages? > > I am not in charge of the mail server. However: May I suggest you > change to a less broken mail client? Your messages show up featuring > really strange MIME content description. Even the standard gmail.com > accounts seem to work better, which is surprising for a web based email > setup. This brokenness may hint the spam processor on postgresql.org > servers to silently droo your emails. > > Alternatively, you may want to "paste" your error messages and stuff on > places like http://rafb.net/paste, and then post the URLs here. > > -- > Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) > "El sentido de las cosas no viene de las cosas, sino de > las inteligencias que las aplican a sus problemas diarios > en busca del progreso." (Ernesto Hernández-Novich) >
On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote: > Thanks Alvaro. > > Here it is again: > > problem description: http://rafb.net/paste/results/bLAtIk26.html > db state before first restart: http://rafb.net/paste/results/D1Bqe125.html > db state before second restart: http://rafb.net/paste/results/D1Bqe125.html > table definition: http://rafb.net/paste/results/W35ccD49.html Ok, so it seems the lockup occured only with the hash indexes? Then it means we still have bugs in the locking code for those. It doesn't surprise me. There's a reason they are not recommended, you know? Let us know if you find the problem showing up again with btree indexes. The hash indexes bugs should be fixed, but they are not high priority ... -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "If it wasn't for my companion, I believe I'd be having the time of my life" (John Dunbar)
actually, I'm using postgresql 7.3.2. I noticed there were some hash index fixes in 7.4, so that might be it. Would therebe any other reason for this deadlock? This database has been in production for almost 2 years and this is the firsttime we've seen the problem. thanks, Eugene > > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: 2005/08/19 Fri AM 11:22:04 EST > To: Eugene <eugene1@sympatico.ca> > CC: pgsql-general@postgresql.org > Subject: Re: [GENERAL] total db lockup > > On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote: > > Thanks Alvaro. > > > > Here it is again: > > > > problem description: http://rafb.net/paste/results/bLAtIk26.html > > db state before first restart: http://rafb.net/paste/results/D1Bqe125.html > > db state before second restart: http://rafb.net/paste/results/D1Bqe125.html > > table definition: http://rafb.net/paste/results/W35ccD49.html > > Ok, so it seems the lockup occured only with the hash indexes? Then it > means we still have bugs in the locking code for those. It doesn't > surprise me. There's a reason they are not recommended, you know? > > Let us know if you find the problem showing up again with btree indexes. > The hash indexes bugs should be fixed, but they are not high priority ... > > -- > Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) > "If it wasn't for my companion, I believe I'd be having > the time of my life" (John Dunbar) >
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Ok, so it seems the lockup occured only with the hash indexes? Then it > means we still have bugs in the locking code for those. "Still" meaning "in the version he's using", which he hasn't told us anywhere that I saw. (Internal evidence suggests it's 7.3 something) > It doesn't surprise me. A deadlock in hash indexes wouldn't be surprising in pre-7.4 code, since 7.4 was the first version that even pretended to be deadlock free in hash indexes. But it does seem pretty surprising that the deadlock checker didn't fire and boot somebody out of the deadlock. We haven't had any bugs reported in deadlock detection in a very long time (since 7.1 I think). This could mean there's still something wrong in there. regards, tom lane