Обсуждение: locking issue on simple selects?
We have a production database server ... it's quite busy but usually working completely fine, simple queries taking a fraction of a millisecond to run. Recently we've frequently encountered issues where some simple selects (meaning, selects doing an index lookup and fetching one row) have become stuck for several minutes. Apparently all requests on one exact table gets stuck, all requests not related to said table are going through without any problems. According to the pg_stat_activity view, all queries getting stuck was read-queries (selects), no updates or anything like that (some of the transactions were doing updates and/or inserts though). The obvious thought seems to be that this is a locking issue ... but it doesn't seem so. For one thing, AFAIK locking shouldn't affect selects, only updates? I've also looked through tons of logs without finding any obvious locking issues. In one of the instances, I could find that there were some open transactions doing updates on one row in the table and then later becoming stuck (as all other transactions) when doing a select on another row in the said table. My second thought was that the database is on the edge of being overloaded and that the memory situation is also just on the edge ... important indexes that used to be in memory now has to be accessed from the disk. Still, it doesn't make sense, we're not seeing any serious impact on the CPU iowait status, and it seems improbable that it should take minutes to load an index? There aren't any long-lasting transactions going on when the jam occurs. I haven't checked much up, usually the jam seems to resolve itself pretty instantly, but I think that at some point it took half a minute from the first query was finished until the pg_stat_activity view got back to normal (meaning typically 0-5 simultaneously processed queries). FWIW, we're running pg 8.3.11, transaction isolation level serializable. The machine is quad-core hyperthreaded (16 CPUs visible to the OS), a SAN is used for storage, and different RAIDs are used for the root partition, pg data and pg wals. Any ideas? I'm aware that some configuration (i.e. checkpoint interval etc) may cause significant delay on write-queries ... but this is only read-queries.
Tobias Brox <tobixen@gmail.com> writes: > Recently we've frequently encountered issues where some simple selects > (meaning, selects doing an index lookup and fetching one row) have > become stuck for several minutes. Apparently all requests on one > exact table gets stuck, all requests not related to said table are > going through without any problems. According to the pg_stat_activity > view, all queries getting stuck was read-queries (selects), no updates > or anything like that (some of the transactions were doing updates > and/or inserts though). > The obvious thought seems to be that this is a locking issue ... but > it doesn't seem so. For one thing, AFAIK locking shouldn't affect > selects, only updates? An exclusive lock will block selects too. Have you looked into pg_locks for ungranted lock requests? regards, tom lane
On 15 September 2010 15:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: > An exclusive lock will block selects too. Have you looked into pg_locks > for ungranted lock requests? Well - I thought so, we have a logging script that logs the content of the pg_locks table, it didn't log anything interesting but it may be a problem with the script itself. It does an inner join on pg_locks.relation = pg_class.oid but when I check now this join seems to remove most of the rows in the pg_locks table. Does it make sense at all to join pg_class with pg_locks? I will ask the sysadm to change to an outer join as for now.
Tobias Brox <tobixen@gmail.com> writes: > Well - I thought so, we have a logging script that logs the content of > the pg_locks table, it didn't log anything interesting but it may be a > problem with the script itself. It does an inner join on > pg_locks.relation = pg_class.oid but when I check now this join seems > to remove most of the rows in the pg_locks table. Does it make sense > at all to join pg_class with pg_locks? I will ask the sysadm to > change to an outer join as for now. There are lots of locks that are not tied to a specific table, so an inner join is definitely bad. You could use an outer join to annotate rows that do correspond to table locks, though. regards, tom lane
Tobias Brox wrote: > I thought so, we have a logging script that logs the content of > the pg_locks table, it didn't log anything interesting but it may be a > problem with the script itself. It does an inner join on > pg_locks.relation = pg_class.oid but when I check now this join seems > to remove most of the rows in the pg_locks table. Some of the waits you might be running into will be things waiting on another transaction holding a lock to finish, which are probably wiped out by this approach. There are some useful examples of lock views on the wiki: http://wiki.postgresql.org/wiki/Lock_Monitoring http://wiki.postgresql.org/wiki/Lock_dependency_information http://wiki.postgresql.org/wiki/Find_Locks And the idea you have of coverting the pg_class one to an outer join will help. The other thing you should try is toggling on log_lock_waits and possibly reducing deadlock_timeout. This will put a lot of the information you're trying to collect right into the logs. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On 10-09-15 03:07 PM, Tobias Brox wrote: > On 15 September 2010 15:39, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> An exclusive lock will block selects too. Have you looked into pg_locks >> for ungranted lock requests? > Well - I thought so, we have a logging script that logs the content of > the pg_locks table, it didn't log anything interesting but it may be a > problem with the script itself. It does an inner join on > pg_locks.relation = pg_class.oid but when I check now this join seems > to remove most of the rows in the pg_locks table. Does it make sense > at all to join pg_class with pg_locks? I will ask the sysadm to > change to an outer join as for now. > You can also enable log_lock_waits and the lock waits will appear in your Postgres logs. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On 15 September 2010 21:28, Greg Smith <greg@2ndquadrant.com> wrote: > There are some useful examples of lock views on the wiki: > > http://wiki.postgresql.org/wiki/Lock_Monitoring > http://wiki.postgresql.org/wiki/Lock_dependency_information > http://wiki.postgresql.org/wiki/Find_Locks Thanks. I think those pages probably should be merged ... hmm ... if I manage to solve my locking issues I should probably try and contribute to the wiki. Reading the wiki pages, for me it boils down to three things: 1) the current query we're logging seems good enough except that we should do an outer join except for inner join towards pg_class, so I've asked our sysadm to fix it. 2) the middle query on http://wiki.postgresql.org/wiki/Lock_Monitoring seems very useful, and I've asked our sysadm to set up logging of this one as well. 3) That log_lock_waits config option that you and Brad points to seems very useful, so I've asked our sysadm to enable it. I also discovered that there is an attribute pg_stat_activity.waiting - I suppose it is 't' if a query is waiting for a lock? It seems quite useful ... > reducing deadlock_timeout. It's set to one second, and some of the jams we have been experiencing has lasted for several minutes. I also think it should say in the pg log if there is a deadlock situation? I grepped for "deadlock" in the logs without finding anything. Well, we'll improve the logging, and wait for the next "jam" to occur ... and I'll make sure to post an update if/when I figure out something.
Tobias Brox wrote:
Certainly the 3rd one could be merged with one of the other two, and maybe all merged into one. I haven't cleaned up that whole area in a whole, it's due for a round of it soon. You're welcome to take a shot at it. We can always revert any mistakes made.
deadlock_timeout is how long a process trying to acquire a lock waits before it a) looks for a deadlock, and b) prints a report that it's waiting into the logs when log_lock_waits is on. So if you're looking for slow lock acquisition that's in the sub-second range, it can be useful to reduce regardless of whether deadlock ever happens. That doesn't sound like your situation though.
I think those pages probably should be merged ... hmm ... if I manage to solve my locking issues I should probably try and contribute to the wiki.
Certainly the 3rd one could be merged with one of the other two, and maybe all merged into one. I haven't cleaned up that whole area in a whole, it's due for a round of it soon. You're welcome to take a shot at it. We can always revert any mistakes made.
reducing deadlock_timeout.It's set to one second, and some of the jams we have been experiencing has lasted for several minutes. I also think it should say in the pg log if there is a deadlock situation?
deadlock_timeout is how long a process trying to acquire a lock waits before it a) looks for a deadlock, and b) prints a report that it's waiting into the logs when log_lock_waits is on. So if you're looking for slow lock acquisition that's in the sub-second range, it can be useful to reduce regardless of whether deadlock ever happens. That doesn't sound like your situation though.
-- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On 15 September 2010 12:05, Tobias Brox <tobixen@gmail.com> wrote: > Recently we've frequently encountered issues where some simple selects > (meaning, selects doing an index lookup and fetching one row) have > become stuck for several minutes. Apparently all requests on one > exact table gets stuck, all requests not related to said table are > going through without any problems. Now I've set up all kind of logging regarding locks, so it seems like we're having issues that aren't lock-related. I just did a bit of research into one situation today. All while having this problem, there was one heavy query running in parallell ... not sure if that's relevant. Then comes one query that requires a seq scan on the problem table (that won't happen again - I just added a new index). Four seconds later comes another query requiring a simple index lookup. Still more queries comes in, most of them simple index lookups, but on different indexes. After one minute there are 25 queries in the pg_stat_activity view towards this table. It's not a particularly huge table. Moments later all 25 queries have been executed.
Tobias Brox <tobixen@gmail.com> wrote: > All while having this problem, there was one heavy query running > in parallell ... not sure if that's relevant. Have you turned on checkpoint logging? You might want to see if these are happening at some particular point in the checkpoint processing. If so, look through the archives for posts from Greg Smith on how to tune that -- he's worked out a nice methodology to iteratively improve your configuration in this regard. -Kevin
On 23 September 2010 22:55, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Have you turned on checkpoint logging? Yes ... it seems so: 13:19:13.840 - LOG: checkpoint complete: wrote 3849 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=269.551 s, sync=0.103 s, total=269.953 s 13:19:13.841 - LOG: checkpoint starting: xlog 13:19:33 - the seq scan query towards the affected table started 13:20:31.454 - one of the index lookup queries towards the affected table was finished 13:20:43.176 - LOG: checkpoint complete: wrote 108199 buffers (6.9%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=11.521 s, sync=77.533 s, total=89.335 s > You might want to see if > these are happening at some particular point in the checkpoint > processing. If so, look through the archives for posts from Greg > Smith on how to tune that -- he's worked out a nice methodology to > iteratively improve your configuration in this regard. Thank you, I will ... hmm ... I found this blog post: http://blog.2ndquadrant.com/en/2010/01/measuring-postgresql-checkpoin.html Of course I'm doing it my own way: select *,now() as snapshot into tmp_pg_stat_bgwriter from pg_stat_bgwriter ; create view tmp_delta_pg_stat_bgwriter as select a.checkpoints_timed-b.checkpoints_timed as checkpoints_timed,a.checkpoints_req-b.checkpoints_req as checkpoints_req,a.buffers_checkpoint-b.buffers_checkpoint as buffers_checkpoint,a.buffers_clean-b.buffers_clean as buffers_clean,a.maxwritten_clean-b.maxwritten_clean as maxwritten_clean,a.buffers_backend-b.buffers_backend as buffers_backend,a.buffers_alloc-b.buffers_alloc as buffers_alloc, now()-b.snapshot as interval from pg_stat_bgwriter a , (select * from tmp_pg_stat_bgwriter order by snapshot desc limit 1) as b; Checkpoint timeout is set to 5 minutes. Right now we're having relatively low activity. I'm not sure how to read the stats below, but they look OK to me: select * from tmp_delta_pg_stat_bgwriter ; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc | interval -------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------+----------------- 3 | 0 | 8277 | 15 | 0 | 185 | 18691 | 00:12:02.988842 (1 row)
Tobias Brox <tobixen@gmail.com> wrote: > 13:19:13.840 - LOG: checkpoint complete > 13:19:13.841 - LOG: checkpoint starting > 13:20:43.176 - LOG: checkpoint complete There wasn't a lot of time between the completion of one checkpoint and the start of the next. And the two checkpoints finished a minute and a half apart. Perhaps you need to boost your checkpoint_segments setting? What is it now? -Kevin