Обсуждение: total db lockup

Поиск
Список
Период
Сортировка

total db lockup

От
Дата:
(NOTE: reposting this for the *fifth* time because my previous messages didn't go through).


Hi all,

We have experienced a really weird problem with
postgresql yesterday. When I was called in to take a
look, all the non-superuser connections were used up
and they were all in a waiting state (SELECT waiting,
UPDATE waiting, etc.). I couldn't figure out what the
problem is, but I saved the ps, pg_stat_activity, and
pg_locks state at the time (attached as
condor_db_stats.txt). BTW, when looking at the queries
in pg_stat_activity, postgresql cuts them off so you
can't see the entire string. Is there any way to
prevent that, or at least increase the character
limit?

We restarted postgresql server and the problem
reappeared a few minutes later. I saved that state
also (condor_db_stats2.txt). At that point, after we
restarted postgresql (again), I ran VACUUM on the
entire database and did a few more things that seem to
have solved the problem (see below).

This particular database is essentially just one flat
table (level) with a few small supporting tables. Only
the level table is heavily used. (table definition is
attached as table.txt). I noticed that one of the
indexes (level_owner_index) was a hash index. I
remembered what postgresql manual says about hash
indexes and concurrency
(http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html)
and, after VACUUM finished, replaced the hash index
with a btree. I then did a REINDEX of the level table
and ANALYZE. This seems to have solved the problem --
at least as of this morning we still have not seen any
deadlocks.

My question is, what could have caused this to happen?
Can anyone explain this paragraph from the manual:

"Share/exclusive page-level locks are used for
read/write access. Locks are released after the page
is processed. Page-level locks provide better
concurrency than index-level ones but are liable to
deadlocks."


Any other pointers to help me figure out what went
wrong and how to fix it?

thanks,

Eugene

WTF? My message doesn't appear. Trying again without
attachments or inline text.


Re: total db lockup

От
Дата:
So can anyone offer any insight on this? BTW, I tried reposting it with attachments and it didn't show up.

thanks,

Eugene


>
> From: <eugene1@sympatico.ca>
> Date: 2005/08/18 Thu AM 09:24:30 EST
> To: <pgsql-general@postgresql.org>
> Subject: [GENERAL] total db lockup
>
> (NOTE: reposting this for the *fifth* time because my previous messages didn't go through).
>
>
> Hi all,
>
> We have experienced a really weird problem with
> postgresql yesterday. When I was called in to take a
> look, all the non-superuser connections were used up
> and they were all in a waiting state (SELECT waiting,
> UPDATE waiting, etc.). I couldn't figure out what the
> problem is, but I saved the ps, pg_stat_activity, and
> pg_locks state at the time (attached as
> condor_db_stats.txt). BTW, when looking at the queries
> in pg_stat_activity, postgresql cuts them off so you
> can't see the entire string. Is there any way to
> prevent that, or at least increase the character
> limit?
>
> We restarted postgresql server and the problem
> reappeared a few minutes later. I saved that state
> also (condor_db_stats2.txt). At that point, after we
> restarted postgresql (again), I ran VACUUM on the
> entire database and did a few more things that seem to
> have solved the problem (see below).
>
> This particular database is essentially just one flat
> table (level) with a few small supporting tables. Only
> the level table is heavily used. (table definition is
> attached as table.txt). I noticed that one of the
> indexes (level_owner_index) was a hash index. I
> remembered what postgresql manual says about hash
> indexes and concurrency
> (http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html)
> and, after VACUUM finished, replaced the hash index
> with a btree. I then did a REINDEX of the level table
> and ANALYZE. This seems to have solved the problem --
> at least as of this morning we still have not seen any
> deadlocks.
>
> My question is, what could have caused this to happen?
> Can anyone explain this paragraph from the manual:
>
> "Share/exclusive page-level locks are used for
> read/write access. Locks are released after the page
> is processed. Page-level locks provide better
> concurrency than index-level ones but are liable to
> deadlocks."
>
>
> Any other pointers to help me figure out what went
> wrong and how to fix it?
>
> thanks,
>
> Eugene
>
> WTF? My message doesn't appear. Trying again without
> attachments or inline text.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


Re: total db lockup

От
Tom Lane
Дата:
<eugene1@sympatico.ca> writes:
> So can anyone offer any insight on this?

Not without information.  You haven't even told us what PG version you
are running, much less provided the necessary details like the pg_locks
status.

> BTW, I tried reposting it with attachments and it didn't show up.

Fix your mail setup and try again.  Or wait a bit --- it's quite
possible the previous messages are just stuck in the moderator's
approval queue.

            regards, tom lane

Re: total db lockup

От
Eugene
Дата:
It seems that I can't. For whatever reason, my messages are being blocked.

Eugene


>
> From: Tom Lane <tgl@sss.pgh.pa.us>
> Date: 2005/08/18 Thu PM 02:11:54 EST
> To: eugene1@sympatico.ca
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] total db lockup
>
>

Вложения

Re: total db lockup

От
Eugene
Дата:
Guys, I really need help on this. Can whoever is in charge of the mailing list change it so that it doesn't delete my
messages?

thanks,

Eugene


>
> From: Eugene <eugene1@sympatico.ca>
> Date: 2005/08/18 Thu PM 04:45:58 EST
> To: Tom Lane <tgl@sss.pgh.pa.us>
> CC: <pgsql-general@postgresql.org>
> Subject: Re: [GENERAL] total db lockup
>
> It seems that I can't. For whatever reason, my messages are being blocked.
>
> Eugene
>
>
> >
> > From: Tom Lane <tgl@sss.pgh.pa.us>
> > Date: 2005/08/18 Thu PM 02:11:54 EST
> > To: eugene1@sympatico.ca
> > CC: pgsql-general@postgresql.org
> > Subject: Re: [GENERAL] total db lockup
> >
> >
>
>

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


Вложения

Re: total db lockup

От
Martijn van Oosterhout
Дата:
Well, perhaps they were blocked for being too large?

Seriously, some of your messages appear to be getting through fine so
if you can't attach them put them on a web or ftp server and post a
link. Much better than copying it to several hundred mailboxes.

On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:
> Guys, I really need help on this. Can whoever is in charge of the
> mailing list change it so that it doesn't delete my messages?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

Re: total db lockup

От
Alvaro Herrera
Дата:
On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:
> Guys, I really need help on this. Can whoever is in charge of the
> mailing list change it so that it doesn't delete my messages?

I am not in charge of the mail server.  However: May I suggest you
change to a less broken mail client?  Your messages show up featuring
really strange MIME content description.  Even the standard gmail.com
accounts seem to work better, which is surprising for a web based email
setup.  This brokenness may hint the spam processor on postgresql.org
servers to silently droo your emails.

Alternatively, you may want to "paste" your error messages and stuff on
places like http://rafb.net/paste, and then post the URLs here.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)

Re: total db lockup

От
Eugene
Дата:
Thanks Alvaro.

Here it is again:

problem description:            http://rafb.net/paste/results/bLAtIk26.html
db state before first restart:  http://rafb.net/paste/results/D1Bqe125.html
db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
table definition:               http://rafb.net/paste/results/W35ccD49.html

thanks,

Eugene

>
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date: 2005/08/18 Thu PM 08:52:17 EST
> To: Eugene <eugene1@sympatico.ca>
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] total db lockup
>
> On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:
> > Guys, I really need help on this. Can whoever is in charge of the
> > mailing list change it so that it doesn't delete my messages?
>
> I am not in charge of the mail server.  However: May I suggest you
> change to a less broken mail client?  Your messages show up featuring
> really strange MIME content description.  Even the standard gmail.com
> accounts seem to work better, which is surprising for a web based email
> setup.  This brokenness may hint the spam processor on postgresql.org
> servers to silently droo your emails.
>
> Alternatively, you may want to "paste" your error messages and stuff on
> places like http://rafb.net/paste, and then post the URLs here.
>
> --
> Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
> "El sentido de las cosas no viene de las cosas, sino de
> las inteligencias que las aplican a sus problemas diarios
> en busca del progreso." (Ernesto Hernández-Novich)
>


Re: total db lockup

От
Alvaro Herrera
Дата:
On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote:
> Thanks Alvaro.
>
> Here it is again:
>
> problem description:            http://rafb.net/paste/results/bLAtIk26.html
> db state before first restart:  http://rafb.net/paste/results/D1Bqe125.html
> db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
> table definition:               http://rafb.net/paste/results/W35ccD49.html

Ok, so it seems the lockup occured only with the hash indexes?  Then it
means we still have bugs in the locking code for those.  It doesn't
surprise me.  There's a reason they are not recommended, you know?

Let us know if you find the problem showing up again with btree indexes.
The hash indexes bugs should be fixed, but they are not high priority ...

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"If it wasn't for my companion, I believe I'd be having
the time of my life"  (John Dunbar)

Re: total db lockup

От
Eugene
Дата:
actually, I'm using postgresql 7.3.2. I noticed there were some hash index fixes in 7.4, so that might be it. Would
therebe any other reason for this deadlock? This database has been in production for almost 2 years and this is the
firsttime we've seen the problem. 

thanks,

Eugene


>
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date: 2005/08/19 Fri AM 11:22:04 EST
> To: Eugene <eugene1@sympatico.ca>
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] total db lockup
>
> On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote:
> > Thanks Alvaro.
> >
> > Here it is again:
> >
> > problem description:            http://rafb.net/paste/results/bLAtIk26.html
> > db state before first restart:  http://rafb.net/paste/results/D1Bqe125.html
> > db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
> > table definition:               http://rafb.net/paste/results/W35ccD49.html
>
> Ok, so it seems the lockup occured only with the hash indexes?  Then it
> means we still have bugs in the locking code for those.  It doesn't
> surprise me.  There's a reason they are not recommended, you know?
>
> Let us know if you find the problem showing up again with btree indexes.
> The hash indexes bugs should be fixed, but they are not high priority ...
>
> --
> Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
> "If it wasn't for my companion, I believe I'd be having
> the time of my life"  (John Dunbar)
>


Re: total db lockup

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Ok, so it seems the lockup occured only with the hash indexes?  Then it
> means we still have bugs in the locking code for those.

"Still" meaning "in the version he's using", which he hasn't told us
anywhere that I saw.  (Internal evidence suggests it's 7.3 something)

> It doesn't surprise me.

A deadlock in hash indexes wouldn't be surprising in pre-7.4 code,
since 7.4 was the first version that even pretended to be deadlock free
in hash indexes.  But it does seem pretty surprising that the deadlock
checker didn't fire and boot somebody out of the deadlock.  We haven't
had any bugs reported in deadlock detection in a very long time (since
7.1 I think).  This could mean there's still something wrong in there.

            regards, tom lane