Обсуждение: postmaster locks up in 7.1b3

Поиск
Список
Период
Сортировка

postmaster locks up in 7.1b3

От
pgsql-bugs@postgresql.org
Дата:
paul vixie (paul@vix.com) reports a bug with a severity of 2
The lower the number the more severe it is.

Short Description
postmaster locks up in 7.1b3

Long Description
this morning at 1:00AM our nightly "vacuum analyze;" ran from cron
and immediately went idle.  both the psql process and the resulting
child of the postmaster were using no CPU time.  all other subsequent
accessors whether psql or DBI also hung.  i was not able to determine
whether they were locking up on opening the session or on the first
command.  by the time i came on the scene there were dozens of hung
children of the postmaster and also dozens of hung psql/DBI processes.

what fixed it was killing off a bunch of remote psql and DBI clients.
nothing was killed on the postmaster host.  the result was that all
hung psql/DBI processes completed normally, all hung children of the
postmaster seemed to complete normally, and the "vacuum analyze"
started actually chewing up CPU and I/O, completing normally about
five minutes later (which is the usual total run time.)

this is on a dual-CPU freebsd-4.1-release host, in case serialization
of access to the shared memory (if any) between the postmaster and
its various children is an issue.  what it felt like was a deadlock
that was broken when the remote psql/DBI clients were killed -- this
would have resulted in a select() wakeup on at least readfds and
exceptfds and perhaps writefds as well.

i am upgrading to 7.1.2 on the postmaster, with a full pg_dumpall and
restore, to rule out "old bugs" (possible?) and on-disk corruption
(possible, too, i guess?) and if it reoccurs i will get stack traces
and fstat's and whatnot.  so this is really just a heads-up for now.

Sample Code


No file was uploaded with this report

Re: postmaster locks up in 7.1b3

От
Tom Lane
Дата:
pgsql-bugs@postgresql.org writes:
> what fixed it was killing off a bunch of remote psql and DBI clients.
> nothing was killed on the postmaster host.  the result was that all
> hung psql/DBI processes completed normally, all hung children of the
> postmaster seemed to complete normally, and the "vacuum analyze"
> started actually chewing up CPU and I/O, completing normally about
> five minutes later (which is the usual total run time.)

My guess is that you had a client that was holding a lock on some
table that's used by most of your clients.  All this would take is
not closing an open transaction after reading/writing the table.
Then vacuum comes along and wants an exclusive lock on that table,
so it sits and waits.  Then everyone else comes along and wants to
read or write that same table.  Normally, their requests would not
conflict with the read or write lock held by the original client
... but they do conflict with vacuum's exclusive-lock request, so
they stack up behind the vacuum.

As far as Postgres is concerned, there's no deadlock here, only a
slow client.  But it's a fairly annoying scenario anyway, since a
client that's hung on some external condition can block everyone else
indirectly through the background VACUUM.

7.2 will use non-exclusive locks for vacuuming (by default, if I
get my way about it), which should make this sort of problem much
less frequent.

> i am upgrading to 7.1.2 on the postmaster,

Good idea --- 7.1b3 had a number of nasty bugs.  But I doubt this
is one of them.

            regards, tom lane

Re: postmaster locks up in 7.1b3

От
Tom Lane
Дата:
Paul A Vixie <vixie@vix.com> writes:
>> As far as Postgres is concerned, there's no deadlock here, only a slow client

> that could be true if we used explicit locks.  all our accesses are of the
> form "learn everything you need to know to do the transaction, then open the
> database, do it, and close".  there are some really long SELECT's (which make
> dns zone files) but they can't block unless the file system is blocking the
> write()'s in the client, which would only happen in NFS, which we don't use.

Well, my point was that it could happen just on the basis of the
*implicit* read lock grabbed by a SELECT.  All you'd need is a client
that's stuck partway through a transaction for some external reason.
However, it sounds like you've taken care to avoid that possibility,
so the theory does seem shaky.

            regards, tom lane

Re: postmaster locks up in 7.1b3

От
Paul A Vixie
Дата:
> As far as Postgres is concerned, there's no deadlock here, only a slow client

that could be true if we used explicit locks.  all our accesses are of the
form "learn everything you need to know to do the transaction, then open the
database, do it, and close".  there are some really long SELECT's (which make
dns zone files) but they can't block unless the file system is blocking the
write()'s in the client, which would only happen in NFS, which we don't use.

your scenario is not implausible, however, and i will watch for it if it
happens again after i upgrade.  i didn't mean to waste any of you guys' time
at this point, i just wanted to let you know about this in case it was another
data point in a problem you were tracking elsewhere, or in case i'm able to
track it more closely later.

thanks for your quick reply.