Обсуждение: Random slow queries
Hi, I have a weird slow query issue I can't seem to find the cause of, so I'm hoping someone here can shed some light on this. Context: I have an application which (among other things) implements a sort of job queue using postgres for persistence. While I know a RDBMS is not necessarily the most ideal tool for this, there are some constraints and requirements which make including a messaging/queueing middleware less than trivial. To reduce locking issues we make use of advisory locks and CTE queries to select jobs from the queue (inspired by https://github.com/chanks/que/blob/master/lib/que/sql.rb) as well as some explicit or optimistic locking during job processing. As the volume of jobs to process isn't extremely high, performance seems very good - except for the issue I'm describing below. The problem: Most queries execute fast, however sometimes queries on the job table (which contains the job queue) take exactly 122 seconds (+ 0-50ms) to execute for no clear reason. I have set log_min_duration_statement to 200ms, and the only queries that are logged are the 2-minute queries. I have also enabled log_lock_waits, but no lock waits are logged. Whereas extremely high performance is not to be expected, a query taking two minutes is way too much - especially seeing as other concurrent queries seem to be completing perfectly on time. As a reference, in a time span of less than 4 minutes the application processes a test load of 2000+ jobs going through 3 stages in the job processing 'pipeline' - each stage using the CTE query to pick up a job. So that's 6000+ executions of the CTE + several 1000 executions of other queries, of which only a few (usually 1-5 per test run) inexplicably take two minutes to complete. The offending queries seem to be quite random: SELECT ... FOR UPDATE queries, CTE queries, and sometimes even regular SELECTs querying our job table take inexplicably long. PostgreSQL version: PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit How PostgreSQL was installed: through GUI installer from EnterpriseDB Changes made to the settings in the postgresql.conf file: name | current_setting | source -----------------------------+----------------------+---------------------- application_name | psql | client client_encoding | WIN1252 | client DateStyle | ISO, DMY | configuration file default_text_search_config | pg_catalog.english | configuration file lc_messages | English_Ireland.1252 | configuration file lc_monetary | English_Ireland.1252 | configuration file lc_numeric | English_Ireland.1252 | configuration file lc_time | English_Ireland.1252 | configuration file listen_addresses | * | configuration file log_autovacuum_min_duration | 100ms | configuration file log_destination | stderr | configuration file log_line_prefix | %m | configuration file log_lock_waits | on | configuration file log_min_duration_statement | 200ms | configuration file log_timezone | Europe/Brussels | configuration file logging_collector | on | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable port | 5432 | configuration file shared_buffers | 128MB | configuration file tcp_keepalives_count | 0 | configuration file tcp_keepalives_idle | 30 | configuration file tcp_keepalives_interval | 30 | configuration file TimeZone | Europe/Brussels | configuration file Operating system and version: Windows 8.1 64bit (version 6.3 build 9600) Program used to connect to PostgreSQL: custom Java application using Hibernate, HikariCP and the pgjdbc-ng 0.6 driver. The application runs on the same machine as the PostgreSQL database (at least for now, during development) Relevant or unusual information in the PostgreSQL server logs?: aside from the slow query log, there is nothing special in the logs I think. For example, this is the entirety of what was logged during one test run - I just shortened the select query a bit (replacing a list of fields with *) for readability as hibernate doesn't produce the most clean queries: 2016-06-27 22:53:06.764 CEST LOG: database system was shut down at 2016-06-27 22:53:04 CEST 2016-06-27 22:53:06.892 CEST LOG: database system is ready to accept connections 2016-06-27 22:53:07.039 CEST LOG: autovacuum launcher started 2016-06-27 22:55:07.764 CEST LOG: automatic analyze of table "app.public.job" system usage: CPU 0.00s/0.06u sec elapsed 0.20 sec 2016-06-27 22:57:09.570 CEST LOG: duration: 122006.000 ms bind cached-1453392550: select job.* from job inner join project on job.project_id = project.id inner join category on project.category_id = category.id inner join type on job.type_id = type.id where job.id=$1 2016-06-27 22:57:09.570 CEST DETAIL: parameters: $1 = '34309' 2016-06-27 22:57:09.572 CEST LOG: duration: 122011.000 ms bind cached--195714239: WITH RECURSIVE jobs AS (SELECT (i).*, pg_try_advisory_lock((i).id) AS locked FROM ( SELECT i FROM job_priority AS i WHERE status = $1 ORDER BY priority DESC, received ASC LIMIT 1) AS i1 UNION ALL (SELECT (i).*, pg_try_advisory_lock((i).id) AS locked FROM ( SELECT ( SELECT i FROM job_priority AS i WHERE status = $2 AND (-priority, received) > (-jobs.priority, jobs.received) ORDER BY priority DESC, received ASC LIMIT 1) AS i FROM jobs WHERE jobs.id IS NOT NULL LIMIT 1) AS i1 ) ) SELECT id, path, status, received, attempts, last_modified, version FROM jobs WHERE locked LIMIT 1 2016-06-27 22:57:09.572 CEST DETAIL: parameters: $1 = 'RECEIVED', $2 = 'RECEIVED' 2016-06-27 22:58:07.332 CEST LOG: automatic analyze of table "app.public.job " system usage: CPU 0.00s/0.06u sec elapsed 0.11 sec 2016-06-27 22:59:07.693 CEST LOG: automatic vacuum of table "app.public.job ": index scans: 1 pages: 0 removed, 140 remain tuples: 136 removed, 2391 remain buffer usage: 546 hits, 0 misses, 209 dirtied avg read rate: 0.000 MB/s, avg write rate: 3.388 MB/s system usage: CPU 0.00s/0.00u sec elapsed 0.48 sec Full table and index schema: CREATE TABLE job ( id INTEGER PRIMARY KEY, a VARCHAR(64) NOT NULL, path VARCHAR(512) NOT NULL UNIQUE, status VARCHAR(20) NOT NULL, project_id INTEGER NOT NULL REFERENCES project(id), type_id INTEGER NOT NULL REFERENCES type(id), b INTEGER NOT NULL, c INTEGER NOT NULL, d VARCHAR(32) NOT NULL, received TIMESTAMP NOT NULL, done TIMESTAMP, e VARCHAR(512), error VARCHAR(512), error_time TIMESTAMP, attempts INTEGER NOT NULL, last_modified TIMESTAMP, version INTEGER ); \d+ output: Column | Type | Modifiers | Storage | Stats ------------------+-----------------------------+-----------+----------+------ target | Description --------+------------- id | integer | not null | plain | a | character varying(64) | not null | extended | path | character varying(512) | not null | extended | status | character varying(20) | not null | extended | project_id | integer | not null | plain | type_id | integer | not null | plain | b | integer | not null | plain | c | integer | not null | plain | d | character varying(32) | not null | extended | received | timestamp without time zone | not null | plain | done | timestamp without time zone | | plain | e | character varying(512) | | extended | error | character varying(512) | | extended | error_time | timestamp without time zone | | plain | attempts | integer | not null | plain | last_modified | timestamp without time zone | | plain | version | integer | | plain | Indexes: "job_pkey" PRIMARY KEY, btree (id) "job_path_key" UNIQUE CONSTRAINT, btree (path) Foreign-key constraints: "job_project_id_fkey" FOREIGN KEY (project _id) REFERENCES project (id) "job_type_id_fkey" FOREIGN KEY (type_id) REFERENCES type(id) Has OIDs: no Additional information on the table: - a few columns contain mostly nulls (i.e. a column indicating any error messages which may occur during processing of the job) - Due to the nature of the application the table does get a high number of inserts/updates in a short time. As mentioned above, in the test in which I observe this issue the job table grows from 0 to 2000-2500 rows very quickly, with each row receiving on average 4 more updates. - Note: the CTE selects from a job_priority view which is a JOIN of 3 tables: job - project - category, with categories having a priority which impacts job processing order. The test case I am currently using comprises only 5 projects and a single category. Explain: unfortunately it's not always the same query that causes issues. If you want I can still run the most common queries through EXPLAIN ANALYZE but haven't included this output here as this email is already long enough... History: this is during development of an application, so there is no production history Hardware: i7 CPU, 15 GB of RAM, SSD - this is a development machine, so there's lots of other stuff running too (but no other databases in active use) Things I tried: - Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit This did not solve the problem, queries still take 122 seconds from time to time - Enable auto_explain to obtain more information. Unfortunately, for the 2-minute queries no plan is logged. If I manually force a query to take a long time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration low enough plans are logged for slow queries, _except_ for these 2-minute mysteries. This makes it very hard to see exactly why the query is taking so long. Is this a bug in auto_explain? Are cached plans never logged (this is not indicated in the documentation)? Some other reason why no plan would be logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3 - Running some queries through EXPLAIN ANALYZE; however, these execute quickly (a few ms) and don't seem to indicate a clear problem to me When I run my test there is very high I/O on the system for a few minutes. However, I am reluctant to point to that as the sole cause for the following reasons: - other concurrent queries execute fast (it's a multi threaded application and other threads continue normally) - if I look at the resource monitor, some queries remain hung up even after disk I/O has quieted down - if I/O performance were the cause I would expect to see a lot more variance, now the queries always take exactly 122s (with only about 50ms variance) Given the reasonably small dataset (a pg_dump of the full database containing about 2500 jobs is less than 1MB) I would think that the whole database fits in memory anyway, making this issue all the more puzzling. Have I missed something obvious? Best regards, Roel
On Tue, Jun 28, 2016 at 8:24 PM, <devel.brain99@xoxy.net> wrote: > The problem: > Most queries execute fast, however sometimes queries on the job > table (which contains the job queue) take exactly 122 seconds > (+ 0-50ms) to execute for no clear reason. > Have I missed something obvious? Please monitor for the start of such an event and capture the full contents of pg_stat_activity and pg_locks during that 2 minute window. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of devel.brain99@xoxy.net Sent: Tuesday, June 28, 2016 9:24 PM To: pgsql-performance@postgresql.org Subject: [PERFORM] Random slow queries Hi, I have a weird slow query issue I can't seem to find the cause of, so I'm hoping someone here can shed some light on this. ........................................ ........................................ Given the reasonably small dataset (a pg_dump of the full database containing about 2500 jobs is less than 1MB) I would thinkthat the whole database fits in memory anyway, making this issue all the more puzzling. Have I missed something obvious? Best regards, Roel ______________________________________________________________________________________________________________________ Did you try AUTO_EXPLAIN extension (https://www.postgresql.org/docs/9.3/static/auto-explain.html) for diagnostic purposes? With auto_explain.loganalize = true it will log automatically EXPLAIN ANALYZE output, rather than just EXPLAIN output. Turningthis parameter ON permanently could have negative impact on over-all performance, so use it judiciously. Regards, Igor
On 29 June 2016 at 14:45, Kevin Grittner <kgrittn@gmail.com> wrote: > Please monitor for the start of such an event and capture the full > contents of pg_stat_activity and pg_locks during that 2 minute > window. I had already looked at that manually and found nothing unusual. To be more thorough, I now had a batch file log the contents of pg_stat_activity and pg_locks every 5 seconds. During my test run, there was one offending query invocation, a simple SELECT * FROM job WHERE field = $1 Of course the actual query specified the list of fields as it was generated by Hibernate, but that is what it boils down to - no joins etc. The column on which was queried is a VARCHAR(64) NOT NULL, not unique nor indexed (though in practice most values are unique). I can of course post the full output somewhere if you want (though it's more than 1000 lines). In the meantime, here is what I can gather from the output for these 2 minutes: 1) Looking at the logged pg_stat_activity data, there usually is only one other query executing: the select * from pg_stat_activity itself. Sometimes there's another query or a connection idle in transaction (which disappears again in the next output from pg_stat_activity), but given the volume of queries that's executed this seems expected. 2) Looking at pg_locks, the only locks that are consistently held throughout those 2 minutes are these 5: - the locks held by the slow query itself: an AccessShareLock on the job table, and a virtualxid ExclusiveLock (the query does not happen within a transaction). - the advisory lock for the job this thread is processing - locks held by the SELECT * FROM pg_locks query (a lock on the pg_locks table and a virtualxid lock) These 5 locks are of course all granted. Other locks change every 5 seconds, and often no other locks are held at all. Best regards, Roel
On 29 June 2016 at 16:32, Igor Neyman <ineyman@perceptron.com> wrote: > Did you try AUTO_EXPLAIN extension > (https://www.postgresql.org/docs/9.3/static/auto-explain.html) for > diagnostic purposes? > With auto_explain.loganalize = true it will log automatically EXPLAIN > ANALYZE output, rather than just EXPLAIN output. Turning this parameter ON > permanently could have negative impact on over-all performance, so use it > judiciously. > > Regards, > Igor > Yes, I tried that. As mentioned in my original email: > Things I tried: > ... > - Enable auto_explain to obtain more information. Unfortunately, for the > 2-minute queries no plan is logged. If I manually force a query to take a > long time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration > low enough plans are logged for slow queries, _except_ for these 2-minute > mysteries. This makes it very hard to see exactly why the query is taking so > long. Is this a bug in auto_explain? Are cached plans never logged (this is > not indicated in the documentation)? Some other reason why no plan would be > logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3 To be more precise, this is the config I used for auto_explain: shared_preload_libraries = 'auto_explain' auto_explain.log_min_duration = 2000 auto_explain.log_analyze = true auto_explain.log_buffers = true auto_explain.log_timing = true On a test query (eg using pg_sleep) there is logging for both the slow query log and for auto_explain; for my actual problem queries there's only the slow query log output, no auto_explain query plan. As you can see from the logs I posted, it appears the execution plan was cached (LOG: duration: 122006.000 ms bind cached-1453392550: select....). Maybe those aren't processed by auto_explain? Best regards, Roel
devel.brain99@xoxy.net writes: > As you can see from the logs I posted, it appears the execution plan was > cached (LOG: duration: 122006.000 ms bind cached-1453392550: select....). > Maybe those aren't processed by auto_explain? In that, "cached-1453392550" is a statement name given by the client; you'd know better than we do where it's coming from, but it has no particular significance to the server. The real information here is that what is taking 122 seconds is the BIND step of extended query protocol. That explains why auto_explain doesn't notice it; auto_explain only instruments the execution phase. Typically, what takes time in the BIND step is planning the query, so it seems like we have to conclude that something in planning is getting hung up. That doesn't get us very much closer to an explanation though :-(. Don't know if it would be practical for you at all, but if you could attach to a process that's stuck like this with a debugger and get a stack trace, that would probably be very informative. https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
On Tue, Jun 28, 2016 at 6:24 PM, <devel.brain99@xoxy.net> wrote: > > > PostgreSQL version: > PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit The current minor version of that branch is 9.3.13, so you are 9 bug fix releases behind. I don't know if this matters, because I see that my first guess of your problem was fixed in commit 4162a55c77cbb54acb4ac442e, which was already included in 9.3.4. (Yes, you did say you also observed the problem in 9.5.3, but still, why intentionally run something that far behind?) > Things I tried: > - Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit > This did not solve the problem, queries still take 122 seconds from time to > time Could you try 9.6beta2? In particular, I am wondering if your problem was solved by commit 8a7d0701814a4e293efad22091d6f6fb441bbe1c Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Aug 26 18:18:57 2015 -0400 Speed up HeapTupleSatisfiesMVCC() by replacing the XID-in-progress test. I am not entirely sure why this (as opposed to the previous-mentioned 4162a55c77cbb54) would fix a problem occurring during BIND, though. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Tue, Jun 28, 2016 at 6:24 PM, <devel.brain99@xoxy.net> wrote: >> PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit > The current minor version of that branch is 9.3.13, so you are 9 bug > fix releases behind. Definitely a fair complaint. > I don't know if this matters, because I see that my first guess of > your problem was fixed in commit 4162a55c77cbb54acb4ac442e, which was > already included in 9.3.4. That commit could have helped if the problem were simply slow planning. But I do not see how it explains a *consistent* 122-second delay. That sounds very much like a timeout expiring someplace, and I have no idea where. regards, tom lane
On 29 June 2016 at 22:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Don't know if it would be practical for you at all, but if you could > attach to a process that's stuck like this with a debugger and get a stack > trace, that would probably be very informative. It seems I have found the cause of my issues: my antivirus software. When I tried to debug a stuck process, the currently executing code was in some DLL without debugging information etc. When checking where the DLL came from, it appeared to be from the security software I had installed. I had been meaning to change because I have had some performance issues in the past (nowhere near as bad as this issue though!) but hadn't yet gotten around to it. After switching this weekend, the issue went away completely. Even though I had previously noticed some performance issues with my antivirus, I must say this still is a very weird failure mode - especially as it still occurred if I disabled all realtime protection (one of the first things I tried, even if not mentioned in my earlier emails). Thanks for the time spent trying to help! Best regards, Roel
On 6/29/16 1:01 PM, devel.brain99@xoxy.net wrote: > During my test run, there was one offending query invocation, a simple > SELECT * FROM job WHERE field = $1 > Of course the actual query specified the list of fields as it was generated > by Hibernate, but that is what it boils down to - no joins etc. The column on > which was queried is a VARCHAR(64) NOT NULL, not unique nor indexed (though > in practice most values are unique). Be careful about your assumptions there... SELECT * can have redically different performance than selecting individual fields. In particular, if you select something that's been toasted external, that's going to produce it's own index scan of the toast table, which could then run into conflicts with vacuuming. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461