Обсуждение: postmaster locks up in 7.1b3
paul vixie (paul@vix.com) reports a bug with a severity of 2 The lower the number the more severe it is. Short Description postmaster locks up in 7.1b3 Long Description this morning at 1:00AM our nightly "vacuum analyze;" ran from cron and immediately went idle. both the psql process and the resulting child of the postmaster were using no CPU time. all other subsequent accessors whether psql or DBI also hung. i was not able to determine whether they were locking up on opening the session or on the first command. by the time i came on the scene there were dozens of hung children of the postmaster and also dozens of hung psql/DBI processes. what fixed it was killing off a bunch of remote psql and DBI clients. nothing was killed on the postmaster host. the result was that all hung psql/DBI processes completed normally, all hung children of the postmaster seemed to complete normally, and the "vacuum analyze" started actually chewing up CPU and I/O, completing normally about five minutes later (which is the usual total run time.) this is on a dual-CPU freebsd-4.1-release host, in case serialization of access to the shared memory (if any) between the postmaster and its various children is an issue. what it felt like was a deadlock that was broken when the remote psql/DBI clients were killed -- this would have resulted in a select() wakeup on at least readfds and exceptfds and perhaps writefds as well. i am upgrading to 7.1.2 on the postmaster, with a full pg_dumpall and restore, to rule out "old bugs" (possible?) and on-disk corruption (possible, too, i guess?) and if it reoccurs i will get stack traces and fstat's and whatnot. so this is really just a heads-up for now. Sample Code No file was uploaded with this report
pgsql-bugs@postgresql.org writes: > what fixed it was killing off a bunch of remote psql and DBI clients. > nothing was killed on the postmaster host. the result was that all > hung psql/DBI processes completed normally, all hung children of the > postmaster seemed to complete normally, and the "vacuum analyze" > started actually chewing up CPU and I/O, completing normally about > five minutes later (which is the usual total run time.) My guess is that you had a client that was holding a lock on some table that's used by most of your clients. All this would take is not closing an open transaction after reading/writing the table. Then vacuum comes along and wants an exclusive lock on that table, so it sits and waits. Then everyone else comes along and wants to read or write that same table. Normally, their requests would not conflict with the read or write lock held by the original client ... but they do conflict with vacuum's exclusive-lock request, so they stack up behind the vacuum. As far as Postgres is concerned, there's no deadlock here, only a slow client. But it's a fairly annoying scenario anyway, since a client that's hung on some external condition can block everyone else indirectly through the background VACUUM. 7.2 will use non-exclusive locks for vacuuming (by default, if I get my way about it), which should make this sort of problem much less frequent. > i am upgrading to 7.1.2 on the postmaster, Good idea --- 7.1b3 had a number of nasty bugs. But I doubt this is one of them. regards, tom lane
Paul A Vixie <vixie@vix.com> writes: >> As far as Postgres is concerned, there's no deadlock here, only a slow client > that could be true if we used explicit locks. all our accesses are of the > form "learn everything you need to know to do the transaction, then open the > database, do it, and close". there are some really long SELECT's (which make > dns zone files) but they can't block unless the file system is blocking the > write()'s in the client, which would only happen in NFS, which we don't use. Well, my point was that it could happen just on the basis of the *implicit* read lock grabbed by a SELECT. All you'd need is a client that's stuck partway through a transaction for some external reason. However, it sounds like you've taken care to avoid that possibility, so the theory does seem shaky. regards, tom lane
> As far as Postgres is concerned, there's no deadlock here, only a slow client that could be true if we used explicit locks. all our accesses are of the form "learn everything you need to know to do the transaction, then open the database, do it, and close". there are some really long SELECT's (which make dns zone files) but they can't block unless the file system is blocking the write()'s in the client, which would only happen in NFS, which we don't use. your scenario is not implausible, however, and i will watch for it if it happens again after i upgrade. i didn't mean to waste any of you guys' time at this point, i just wanted to let you know about this in case it was another data point in a problem you were tracking elsewhere, or in case i'm able to track it more closely later. thanks for your quick reply.