Обсуждение: Query performance inconsistant.
I have been having performance problems with my DB so this morning I added some config to log queries that take more than 250ms. The result is surprising because some queries will take as long as 10 seconds, but then you do a explain analyze on them they show that indexes are being used and they run very fast. Here is an example: 2006-08-31 05:55:39.560 LOG: duration: 3835.182 ms statement: select acctMessage( <params hidden to protect the innocent> ) But the same query returns this when I explain it: > > explain analyze select acctMessage( <params hidden to protect the innocent> ); QUERY PLAN -------------------------------------------------------------------------------------- Result (cost=0.00..0.03 rows=1 width=0) (actual time=26.797..26.799 rows=1 loops=1) Total runtime: 36.838 ms So the question is, why do some queries take a very long time? Sure, the obvious answer is the machine was busy doing something else, but I can't find any evidence of that: 05:40:01 AM CPU %user %nice %system %iowait %steal %idle 05:54:01 AM all 0.58 0.00 0.47 0.50 0.00 98.45 05:55:01 AM all 0.46 0.00 0.31 3.45 0.00 95.79 05:56:01 AM all 0.75 0.00 0.25 4.32 0.00 94.69 05:40:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 05:54:01 AM 96376 3019292 96.91 6724 2789348 1004000 20 0.00 20 05:55:01 AM 92904 3022764 97.02 7844 2791424 1004000 20 0.00 20 05:56:01 AM 98840 3016828 96.83 9056 2784160 1004000 20 0.00 20 05:40:01 AM tps rtps wtps bread/s bwrtn/s 05:54:01 AM 21.53 4.35 17.18 67.77 344.84 05:55:01 AM 71.61 59.11 12.50 1202.79 283.57 05:56:01 AM 29.22 13.94 15.29 264.18 316.59 Any thoughts on how to track this down? I don't want to go buy a faster server when I can't confirm that hardware performance is the problem. Thanks, schu
Matthew Schumacher <matt.s@aptalaska.net> writes: > I have been having performance problems with my DB so this morning I > added some config to log queries that take more than 250ms. The result > is surprising because some queries will take as long as 10 seconds, but > then you do a explain analyze on them they show that indexes are being > used and they run very fast. Is it possible that it's not directly that query's fault? For instance it could be blocked by a lock held by some other transaction. I can't unfortunately think of any very nice way to deduce this from log entries ... you'd have to catch it in the act and look into pg_locks to find out who's the perpetrator. regards, tom lane
Tom Lane wrote: > Matthew Schumacher <matt.s@aptalaska.net> writes: >> I have been having performance problems with my DB so this morning I >> added some config to log queries that take more than 250ms. The result >> is surprising because some queries will take as long as 10 seconds, but >> then you do a explain analyze on them they show that indexes are being >> used and they run very fast. > > Is it possible that it's not directly that query's fault? For instance > it could be blocked by a lock held by some other transaction. I can't > unfortunately think of any very nice way to deduce this from log entries > ... you'd have to catch it in the act and look into pg_locks to find out > who's the perpetrator. > > regards, tom lane This does help me try to figure out where the problem is. The proc in question inserts in a very large table, and updates another large table. Since postgres puts each proc in it's own transaction I'm thinking the problem may be the database locking these large tables while this proc is called concurrently. In order to understand this better I need to know how postgres locking works and when locks are used. Do you know of any documentation that I can read that explains this? Thanks, schu
On Thu, 2006-08-31 at 11:04 -0800, Matthew Schumacher wrote: > Tom Lane wrote: > > Matthew Schumacher <matt.s@aptalaska.net> writes: > >> I have been having performance problems with my DB so this morning I > >> added some config to log queries that take more than 250ms. The result > >> is surprising because some queries will take as long as 10 seconds, but > >> then you do a explain analyze on them they show that indexes are being > >> used and they run very fast. > > > > Is it possible that it's not directly that query's fault? For instance > > it could be blocked by a lock held by some other transaction. I can't > > unfortunately think of any very nice way to deduce this from log entries > > ... you'd have to catch it in the act and look into pg_locks to find out > > who's the perpetrator. > > > > regards, tom lane > > This does help me try to figure out where the problem is. The proc in > question inserts in a very large table, and updates another large table. > Since postgres puts each proc in it's own transaction I'm thinking the > problem may be the database locking these large tables while this proc > is called concurrently. > > In order to understand this better I need to know how postgres locking > works and when locks are used. Do you know of any documentation that I > can read that explains this? http://www.postgresql.org/docs/8.1/static/mvcc.html In the "Explicit Locking" section it details the locks acquired by UPDATE, etc. From what you described, I would not expect many locking problems. Are there any other types of queries you run that may cause a lock? Do you run periodic "VACUUM FULL" or something? "VACUUM FULL" causes a full table lock, and is usually not necessary. If so, try running just "VACUUM" without "FULL". Regards, Jeff Davis
Jeff Davis wrote: > http://www.postgresql.org/docs/8.1/static/mvcc.html > > In the "Explicit Locking" section it details the locks acquired by > UPDATE, etc. > >From what you described, I would not expect many locking problems. Are > there any other types of queries you run that may cause a lock? Do you > run periodic "VACUUM FULL" or something? "VACUUM FULL" causes a full > table lock, and is usually not necessary. If so, try running just > "VACUUM" without "FULL". > > Regards, > Jeff Davis > Jeff, I have "autovacuum = on" in the config file with a pretty frequent autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full. schu
Jeff Davis <pgsql@j-davis.com> writes: > From what you described, I would not expect many locking problems. Are > there any other types of queries you run that may cause a lock? Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly if this is a pre-8.1 Postgres where exclusive row locks were used for foreign key constraints. regards, tom lane
Matthew Schumacher <matt.s@aptalaska.net> writes: > I have "autovacuum = on" in the config file with a pretty frequent > autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full. autovacuum *never* does a vacuum full, because that would lead to unexpected blockages of foreground queries. Still though, autovac could be contributing to the problem indirectly. I'm assuming that most of your transactions on the problem table are short. It's possible that one or more clients are grabbing quasi-exclusive table locks, and normally you don't notice because they are able to get the lock quickly, do their work, and get out. But if autovac is working on the table then the requestor of the exclusive lock blocks ... and everyone else queues up behind him, until the vacuum command finishes with the table. regards, tom lane
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> From what you described, I would not expect many locking problems. Are >> there any other types of queries you run that may cause a lock? > > Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly > if this is a pre-8.1 Postgres where exclusive row locks were used for > foreign key constraints. > > regards, tom lane Tom, I'm still having issues with this so lemme provide more information.... perhaps there is something obvious.... Here is the proc that has very inconsistent (anywhere from 25ms to 8000ms) performance: CREATE FUNCTION acctmessage(_accttype character varying, _username character varying, _ipaddress character varying, _nastimestamp timestamp with time zone, _sessionid character varying, _nassessionid character varying, _nasipaddress character varying, _input octets bigint, _outputoctets bigint, _inputgigawords integer, _outputgigawords integer, _sessionlength bigint, _termcause charact er varying, _nasidentifier character varying, _clientipaddress character varying, _nasport character varying, _framedprotocol cha racter varying, _servicetype character varying, _connectinfo character varying) RETURNS void AS $$ DECLARE session_rec RECORD; BEGIN IF _clientipaddress <> '127.0.0.1' THEN INSERT into accounting_tab ( acctType, userName, ipAddress, nasTimestamp, sessionId, nasSessionId, nasIpAddress, inputOctets, outputOctets, inputGigaWords, outputGigaWords, sessionLength, termCause, nasIdentifier, clientIpAddress, nasPort, framedProtocol, serviceType, connectInfo ) values ( _acctType, _userName, _ipAddress, _nasTimestamp, _sessionId, _nasSessionId, _nasIpAddress, _inputOctets, _outputOctets, _inputGigaWords, _outputGigaWords, _sessionLength, _termCause, _nasIdentifier, _clientIpAddress, _nasPort, _framedProtocol, _serviceType, _connectInfo ); END IF; SELECT INTO session_rec sessionId FROM radutmp_tab WHERE sessionId = _sessionId; IF session_rec.sessionId IS NULL AND _acctType = 'start' THEN INSERT into radutmp_tab ( lastAcctType, userName, ipAddress, nasStartTimestamp, sessionId, nasSessionId, nasIpAddress, inputOctets, outputOctets, inputGigaWords, outputGigaWords, sessionLength, termCause, nasIdentifier, clientIpAddress, nasPort, framedProtocol, serviceType ) values ( _acctType, _userName, _ipAddress, _nasTimestamp, _sessionId, _nasSessionId, _nasIpAddress, _inputOctets, _outputOctets, _inputGigaWords, _outputGigaWords, _sessionLength, _termCause, _nasIdentifier, _clientIpAddress, _nasPort, _framedProtocol, _serviceType ) ; ELSIF session_rec.sessionId IS NOT NULL AND _acctType = 'stop' THEN UPDATE radutmp_tab SET lastAcctType = _acctType, nasStopTimestamp = _nasTimestamp, ipAddress = _ipAddress, sessionlength = _sessionlength, inputOctets = _inputOctets, outputOctets = _outputOctets, inputgigawords = _inputgigawords, outputgigawords = _outputgigawords, nasSessionId = _nasSessionId, nasIPAddress = _nasIPAddress, clientIPAddress = _clientIPAddress, nasPort = _nasPort, framedProtocol = _framedProtocol, termCause = _termCause WHERE sessionId = _sessionId AND userName = _userName AND serviceType = _serviceType; END IF; END; $$ LANGUAGE plpgsql; It looks long, but it's really pretty simple, it inserts data into the accounting_tab and then updates or inserts into the radutmp_tab table based on whether the session ID is known or not. Here are the tables: Table "public.accounting_tab" Column | Type | Modifiers -----------------+--------------------------+--------------- sessionid | character varying(32) | not null nassessionid | character varying(32) | not null accttype | character varying(6) | not null username | character varying(20) | not null nastimestamp | timestamp with time zone | nasipaddress | character varying(15) | not null nasidentifier | character varying(15) | clientipaddress | character varying(15) | not null servicetype | character varying(6) | not null sessionlength | bigint | default 0 inputoctets | bigint | default 0 outputoctets | bigint | default 0 inputgigawords | integer | default 0 outputgigawords | integer | default 0 nasport | character varying(32) | ipaddress | character varying(32) | framedprotocol | character varying(32) | termcause | character varying(32) | timestamp | timestamp with time zone | default now() connectinfo | character varying(100) | Indexes: "accounting_nasidentifier_idx" btree (nasidentifier) "accounting_nastimestamp_idx" btree (nastimestamp) "accounting_sessionid_idx" btree (sessionid) "accounting_timestamp_idx" btree ("timestamp") "accounting_username_idx" btree (username) Table "public.radutmp_tab" Column | Type | Modifiers -------------------+--------------------------+--------------- sessionid | character varying(32) | not null nassessionid | character varying(32) | not null lastaccttype | character varying(6) | not null username | character varying(20) | not null nasstarttimestamp | timestamp with time zone | nasstoptimestamp | timestamp with time zone | nasipaddress | character varying(15) | not null nasidentifier | character varying(15) | clientipaddress | character varying(15) | not null servicetype | character varying(6) | not null sessionlength | bigint | default 0 inputoctets | bigint | default 0 outputoctets | bigint | default 0 inputgigawords | integer | default 0 outputgigawords | integer | default 0 nasport | character varying(32) | ipaddress | character varying(32) | framedprotocol | character varying(32) | termcause | character varying(32) | timestamp | timestamp with time zone | default now() Indexes: "radutmp_tab_pkey" PRIMARY KEY, btree (sessionid) "radutmp_nasstoptimestamp_idx" btree (nasstoptimestamp) "radutmp_servicetype_idx" btree (servicetype) "radutmp_username_idx" btree (username) As you can see the tables are pretty simple and don't have any foreign keys, but they are large. accounting_tab is 1.8mill rows and radutmp_tab is 200k rows. I should note that This query gets run quite a bit as well: SELECT COUNT(*) FROM radutmp_tab WHERE UserName='username' AND nasStopTimestamp IS NULL and serviceType = 'servicetype' It tells me if there is an active session or not. This setup, with concurrency, is returning very inconsistent query performance. Sometimes its very fast, other times it's slow and waits. This makes me think I have a table locking issue, but I'm not sure since pg_locks rarely reports more than this: locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted ---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+-----------------+--------- relation | 16395 | 10342 | | | | | | | 30641038 | 8720 | AccessShareLock | t transactionid | | | | | 30641038 | | | | 30641038 | 8720 | ExclusiveLock | t Do you see anything wrong with how I'm doing this? Perhaps it's time to get faster hardware, but it doesn't seem like the box is that loaded. Thanks, schu
Matthew Schumacher <matt.s@aptalaska.net> writes: > Here is the proc that has very inconsistent (anywhere from 25ms to > 8000ms) performance: > ... > This setup, with concurrency, is returning very inconsistent query > performance. Sometimes its very fast, other times it's slow and waits. > This makes me think I have a table locking issue, but I'm not sure > since pg_locks rarely reports more than this: Yeah, I suspect a locking issue too, but you won't find out what it is unless you can capture the content of pg_locks at a time when the proc is being delayed. You could try making the proc do LOCK TABLE ... IN ROW EXCLUSIVE MODE NOWAIT inside an exception block, catching the error, and logging the contents of pg_locks to someplace. regards, tom lane