Обсуждение: Random slow queries

Поиск
Список
Период
Сортировка

Random slow queries

От
devel.brain99@xoxy.net
Дата:
Hi,

I have a weird slow query issue I can't seem to find the cause of, so I'm
hoping someone here can shed some light on this.

Context:
I have an application which (among other things) implements a sort of job
queue using postgres for persistence. While I know a RDBMS is not necessarily
the most ideal tool for this, there are some constraints and requirements
which make including a messaging/queueing middleware less than trivial.
To reduce locking issues we make use of advisory locks and CTE queries to
select jobs from the queue (inspired by
https://github.com/chanks/que/blob/master/lib/que/sql.rb) as well as some
explicit or optimistic locking during job processing. As the volume of jobs
to process isn't extremely high, performance seems very good - except for the
issue I'm describing below.

The problem:
Most queries execute fast, however sometimes queries on the job table (which
contains the job queue) take exactly 122 seconds (+ 0-50ms) to execute for no
clear reason. I have set log_min_duration_statement to 200ms, and the only
queries that are logged are the 2-minute queries. I have also enabled
log_lock_waits, but no lock waits are logged. Whereas extremely high
performance is not to be expected, a query taking two minutes is way too
much - especially seeing as other concurrent queries seem to be completing
perfectly on time.
As a reference, in a time span of less than 4 minutes the application
processes a test load of 2000+ jobs going through 3 stages in the job
processing 'pipeline' - each stage using the CTE query to pick up a job. So
that's 6000+ executions of the CTE + several 1000 executions of other queries,
of which only a few (usually 1-5 per test run) inexplicably take two minutes
to complete. The offending queries seem to be quite random: SELECT ... FOR
UPDATE queries, CTE queries, and sometimes even regular SELECTs querying our
job table take inexplicably long.


PostgreSQL version:
PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit

How PostgreSQL was installed: through GUI installer from EnterpriseDB

Changes made to the settings in the postgresql.conf file:
            name             |   current_setting    |        source
-----------------------------+----------------------+----------------------
 application_name            | psql                 | client
 client_encoding             | WIN1252              | client
 DateStyle                   | ISO, DMY             | configuration file
 default_text_search_config  | pg_catalog.english   | configuration file
 lc_messages                 | English_Ireland.1252 | configuration file
 lc_monetary                 | English_Ireland.1252 | configuration file
 lc_numeric                  | English_Ireland.1252 | configuration file
 lc_time                     | English_Ireland.1252 | configuration file
 listen_addresses            | *                    | configuration file
 log_autovacuum_min_duration | 100ms                | configuration file
 log_destination             | stderr               | configuration file
 log_line_prefix             | %m                   | configuration file
 log_lock_waits              | on                   | configuration file
 log_min_duration_statement  | 200ms                | configuration file
 log_timezone                | Europe/Brussels      | configuration file
 logging_collector           | on                   | configuration file
 max_connections             | 100                  | configuration file
 max_stack_depth             | 2MB                  | environment variable
 port                        | 5432                 | configuration file
 shared_buffers              | 128MB                | configuration file
 tcp_keepalives_count        | 0                    | configuration file
 tcp_keepalives_idle         | 30                   | configuration file
 tcp_keepalives_interval     | 30                   | configuration file
 TimeZone                    | Europe/Brussels      | configuration file

Operating system and version: Windows 8.1 64bit (version 6.3 build 9600)

Program used to connect to PostgreSQL: custom Java application using
Hibernate, HikariCP and the pgjdbc-ng 0.6 driver. The application runs on the
same machine as the PostgreSQL database (at least for now, during development)

Relevant or unusual information in the PostgreSQL server logs?: aside from the
slow query log, there is nothing special in the logs I think.
For example, this is the entirety of what was logged during one test run - I
just shortened the select query a bit (replacing a list of fields with *) for
readability as hibernate doesn't produce the most clean queries:
2016-06-27 22:53:06.764 CEST LOG:  database system was shut down at
2016-06-27 22:53:04 CEST
2016-06-27 22:53:06.892 CEST LOG:  database system is ready to accept
connections
2016-06-27 22:53:07.039 CEST LOG:  autovacuum launcher started
2016-06-27 22:55:07.764 CEST LOG:  automatic analyze of table
"app.public.job" system usage: CPU 0.00s/0.06u sec elapsed 0.20 sec
2016-06-27 22:57:09.570 CEST LOG:  duration: 122006.000 ms  bind
cached-1453392550: select job.* from job inner join project on
job.project_id = project.id inner join category on project.category_id
= category.id inner join type on job.type_id = type.id where job.id=$1
2016-06-27 22:57:09.570 CEST DETAIL:  parameters: $1 = '34309'
2016-06-27 22:57:09.572 CEST LOG:  duration: 122011.000 ms  bind
cached--195714239: WITH RECURSIVE jobs AS (SELECT (i).*,
pg_try_advisory_lock((i).id) AS locked FROM ( SELECT i FROM
job_priority AS i WHERE status = $1 ORDER BY priority DESC, received
ASC LIMIT 1) AS i1 UNION ALL (SELECT (i).*,
pg_try_advisory_lock((i).id) AS locked FROM ( SELECT ( SELECT i FROM
job_priority AS i WHERE status = $2 AND (-priority, received) >
(-jobs.priority, jobs.received) ORDER BY priority DESC, received ASC
LIMIT 1) AS i FROM jobs WHERE jobs.id IS NOT NULL LIMIT 1) AS i1 ) )
SELECT id, path, status, received, attempts, last_modified, version
FROM jobs WHERE locked LIMIT 1
2016-06-27 22:57:09.572 CEST DETAIL:  parameters: $1 = 'RECEIVED', $2
= 'RECEIVED'
2016-06-27 22:58:07.332 CEST LOG:  automatic analyze of table
"app.public.job " system usage: CPU 0.00s/0.06u sec elapsed 0.11 sec
2016-06-27 22:59:07.693 CEST LOG:  automatic vacuum of table
"app.public.job ": index scans: 1
    pages: 0 removed, 140 remain
    tuples: 136 removed, 2391 remain
    buffer usage: 546 hits, 0 misses, 209 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 3.388 MB/s
    system usage: CPU 0.00s/0.00u sec elapsed 0.48 sec


Full table and index schema:
CREATE TABLE job
(
  id INTEGER PRIMARY KEY,
  a VARCHAR(64) NOT NULL,
  path VARCHAR(512) NOT NULL UNIQUE,
  status VARCHAR(20) NOT NULL,
  project_id INTEGER NOT NULL REFERENCES project(id),
  type_id INTEGER NOT NULL REFERENCES type(id),
  b INTEGER NOT NULL,
  c INTEGER NOT NULL,
  d VARCHAR(32) NOT NULL,
  received TIMESTAMP NOT NULL,
  done TIMESTAMP,
  e VARCHAR(512),
  error VARCHAR(512),
  error_time TIMESTAMP,
  attempts INTEGER NOT NULL,
  last_modified TIMESTAMP,
  version INTEGER
);
\d+ output:
      Column      |            Type             | Modifiers | Storage  | Stats
------------------+-----------------------------+-----------+----------+------
 target | Description
--------+-------------
 id               | integer                     | not null  | plain    |
 a                | character varying(64)       | not null  | extended |
 path             | character varying(512)      | not null  | extended |
 status           | character varying(20)       | not null  | extended |
 project_id       | integer                     | not null  | plain    |
 type_id          | integer                     | not null  | plain    |
 b                | integer                     | not null  | plain    |
 c                | integer                     | not null  | plain    |
 d                | character varying(32)       | not null  | extended |
 received         | timestamp without time zone | not null  | plain    |
 done             | timestamp without time zone |           | plain    |
 e                | character varying(512)      |           | extended |
 error            | character varying(512)      |           | extended |
 error_time       | timestamp without time zone |           | plain    |
 attempts         | integer                     | not null  | plain    |
 last_modified    | timestamp without time zone |           | plain    |
 version          | integer                     |           | plain    |
Indexes:
    "job_pkey" PRIMARY KEY, btree (id)
    "job_path_key" UNIQUE CONSTRAINT, btree (path)
Foreign-key constraints:
    "job_project_id_fkey" FOREIGN KEY (project _id) REFERENCES project (id)
    "job_type_id_fkey" FOREIGN KEY (type_id) REFERENCES type(id)
Has OIDs: no

Additional information on the table:
- a few columns contain mostly nulls (i.e. a column indicating any error
messages which may occur during processing of the job)
- Due to the nature of the application the table does get a high number of
inserts/updates in a short time. As mentioned above, in the test in which I
observe this issue the job table grows from 0 to 2000-2500 rows very quickly,
with each row receiving on average 4 more updates.
- Note: the CTE selects from a job_priority view which is a JOIN of 3 tables:
job - project - category, with categories having a priority which impacts job
processing order. The test case I am currently using comprises only 5 projects
and a single category.

Explain: unfortunately it's not always the same query that causes issues. If
you want I can still run the most common queries through EXPLAIN ANALYZE but
haven't included this output here as this email is already long enough...

History: this is during development of an application, so there is no
production history

Hardware: i7 CPU, 15 GB of RAM, SSD - this is a development machine, so
there's lots of other stuff running too (but no other databases in active use)


Things I tried:
- Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit
This did not solve the problem, queries still take 122 seconds from time to
time
- Enable auto_explain to obtain more information. Unfortunately, for the
2-minute queries no plan is logged. If I manually force a query to take a long
time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration low
enough plans are logged for slow queries, _except_ for these 2-minute
mysteries. This makes it very hard to see exactly why the query is taking so
long. Is this a bug in auto_explain? Are cached plans never logged (this is
not indicated in the documentation)? Some other reason why no plan would be
logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3
- Running some queries through EXPLAIN ANALYZE; however, these execute quickly
(a few ms) and don't seem to indicate a clear problem to me


When I run my test there is very high I/O on the system for a few minutes.
However, I am reluctant to point to that as the sole cause for the following
reasons:
- other concurrent queries execute fast (it's a multi threaded application and
other threads continue normally)
- if I look at the resource monitor, some queries remain hung up even after
disk I/O has quieted down
- if I/O performance were the cause I would expect to see a lot more variance,
now the queries always take exactly 122s (with only about 50ms variance)

Given the reasonably small dataset (a pg_dump of the full database containing
about 2500 jobs is less than 1MB) I would think that the whole database fits
in memory anyway, making this issue all the more puzzling. Have I missed
something obvious?

Best regards,
Roel



Re: Random slow queries

От
Kevin Grittner
Дата:
On Tue, Jun 28, 2016 at 8:24 PM,  <devel.brain99@xoxy.net> wrote:

> The problem:
> Most queries execute fast, however sometimes queries on the job
> table (which contains the job queue) take exactly 122 seconds
> (+ 0-50ms) to execute for no clear reason.

> Have I missed something obvious?

Please monitor for the start of such an event and capture the full
contents of pg_stat_activity and pg_locks during that 2 minute
window.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Random slow queries

От
Igor Neyman
Дата:
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
devel.brain99@xoxy.net
Sent: Tuesday, June 28, 2016 9:24 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Random slow queries

Hi,

I have a weird slow query issue I can't seem to find the cause of, so I'm hoping someone here can shed some light on
this.

........................................
........................................


Given the reasonably small dataset (a pg_dump of the full database containing about 2500 jobs is less than 1MB) I would
thinkthat the whole database fits in memory anyway, making this issue all the more puzzling. Have I missed something
obvious?

Best regards,
Roel

______________________________________________________________________________________________________________________

Did you try AUTO_EXPLAIN extension (https://www.postgresql.org/docs/9.3/static/auto-explain.html) for diagnostic
purposes?
With auto_explain.loganalize = true it will log automatically EXPLAIN ANALYZE output, rather than just EXPLAIN output.
Turningthis parameter ON permanently could have negative impact on over-all performance, so use it judiciously.  
 

Regards,
Igor


Re: Random slow queries

От
devel.brain99@xoxy.net
Дата:
On 29 June 2016 at 14:45, Kevin Grittner <kgrittn@gmail.com> wrote:
> Please monitor for the start of such an event and capture the full
> contents of pg_stat_activity and pg_locks during that 2 minute
> window.

I had already looked at that manually and found nothing unusual. To be more
thorough, I now had a batch file log the contents of pg_stat_activity and
pg_locks every 5 seconds.

During my test run, there was one offending query invocation, a simple
SELECT * FROM job WHERE field = $1
Of course the actual query specified the list of fields as it was generated
by Hibernate, but that is what it boils down to - no joins etc. The column on
which was queried is a VARCHAR(64) NOT NULL, not unique nor indexed (though
in practice most values are unique).

I can of course post the full output somewhere if you want (though it's more
than 1000 lines). In the meantime, here is what I can gather from the output
for these 2 minutes:

1) Looking at the logged pg_stat_activity data, there usually is only one
other query executing: the select * from pg_stat_activity itself. Sometimes
there's another query or a connection idle in transaction (which disappears
again in the next output from pg_stat_activity), but given the volume of
queries that's executed this seems expected.

2) Looking at pg_locks, the only locks that are consistently held throughout
those 2 minutes are these 5:
- the locks held by the slow query itself: an AccessShareLock on the job
table, and a virtualxid ExclusiveLock (the query does not happen within a
transaction).
- the advisory lock for the job this thread is processing
- locks held by the SELECT * FROM pg_locks query (a lock on the pg_locks
table and a virtualxid lock)

These 5 locks are of course all granted. Other locks change every 5 seconds,
and often no other locks are held at all.

Best regards,
Roel



Re: Random slow queries

От
devel.brain99@xoxy.net
Дата:
On 29 June 2016 at 16:32, Igor Neyman <ineyman@perceptron.com> wrote:
> Did you try AUTO_EXPLAIN extension
> (https://www.postgresql.org/docs/9.3/static/auto-explain.html) for
> diagnostic purposes?
> With auto_explain.loganalize = true it will log automatically EXPLAIN
> ANALYZE output, rather than just EXPLAIN output.  Turning this parameter ON
> permanently could have negative impact on over-all performance, so use it
> judiciously.
>
> Regards,
> Igor
>

Yes, I tried that. As mentioned in my original email:
> Things I tried:
> ...
> - Enable auto_explain to obtain more information. Unfortunately, for the
> 2-minute queries no plan is logged. If I manually force a query to take a
> long time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration
> low enough plans are logged for slow queries, _except_ for these 2-minute
> mysteries. This makes it very hard to see exactly why the query is taking so
> long. Is this a bug in auto_explain? Are cached plans never logged (this is
> not indicated in the documentation)? Some other reason why no plan would be
> logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3

To be more precise, this is the config I used for auto_explain:
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 2000
auto_explain.log_analyze = true
auto_explain.log_buffers = true
auto_explain.log_timing = true

On a test query (eg using pg_sleep) there is logging for both the slow query
log and for auto_explain; for my actual problem queries there's only the slow
query log output, no auto_explain query plan.
As you can see from the logs I posted, it appears the execution plan was
cached (LOG:  duration: 122006.000 ms  bind cached-1453392550: select....).
Maybe those aren't processed by auto_explain?

Best regards,
Roel



Re: Random slow queries

От
Tom Lane
Дата:
devel.brain99@xoxy.net writes:
> As you can see from the logs I posted, it appears the execution plan was
> cached (LOG:  duration: 122006.000 ms  bind cached-1453392550: select....).
> Maybe those aren't processed by auto_explain?

In that, "cached-1453392550" is a statement name given by the client;
you'd know better than we do where it's coming from, but it has no
particular significance to the server.

The real information here is that what is taking 122 seconds is the BIND
step of extended query protocol.  That explains why auto_explain doesn't
notice it; auto_explain only instruments the execution phase.  Typically,
what takes time in the BIND step is planning the query, so it seems like
we have to conclude that something in planning is getting hung up.  That
doesn't get us very much closer to an explanation though :-(.

Don't know if it would be practical for you at all, but if you could
attach to a process that's stuck like this with a debugger and get a stack
trace, that would probably be very informative.

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane


Re: Random slow queries

От
Jeff Janes
Дата:
On Tue, Jun 28, 2016 at 6:24 PM,  <devel.brain99@xoxy.net> wrote:
>
>
> PostgreSQL version:
> PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit

The current minor version of that branch is 9.3.13, so you are 9 bug
fix releases behind.

I don't know if this matters, because I see that my first guess of
your problem was fixed in commit 4162a55c77cbb54acb4ac442e, which was
already included in 9.3.4.  (Yes, you did say you also observed the
problem in 9.5.3, but still, why intentionally run something that far
behind?)


> Things I tried:
> - Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit
> This did not solve the problem, queries still take 122 seconds from time to
> time

Could you try 9.6beta2?

In particular, I am wondering if your problem was solved by

commit 8a7d0701814a4e293efad22091d6f6fb441bbe1c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Aug 26 18:18:57 2015 -0400

    Speed up HeapTupleSatisfiesMVCC() by replacing the XID-in-progress test.


I am not entirely sure why this (as opposed to the previous-mentioned
4162a55c77cbb54) would fix a problem occurring during BIND, though.

Cheers,

Jeff


Re: Random slow queries

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Tue, Jun 28, 2016 at 6:24 PM,  <devel.brain99@xoxy.net> wrote:
>> PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit

> The current minor version of that branch is 9.3.13, so you are 9 bug
> fix releases behind.

Definitely a fair complaint.

> I don't know if this matters, because I see that my first guess of
> your problem was fixed in commit 4162a55c77cbb54acb4ac442e, which was
> already included in 9.3.4.

That commit could have helped if the problem were simply slow planning.
But I do not see how it explains a *consistent* 122-second delay.
That sounds very much like a timeout expiring someplace, and I have
no idea where.

            regards, tom lane


Re: Random slow queries

От
devel.brain99@xoxy.net
Дата:
On 29 June 2016 at 22:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Don't know if it would be practical for you at all, but if you could
> attach to a process that's stuck like this with a debugger and get a stack
> trace, that would probably be very informative.

It seems I have found the cause of my issues: my antivirus software.
When I tried to debug a stuck process, the currently executing code was in
some DLL without debugging information etc. When checking where the DLL came
from, it appeared to be from the security software I had installed.

I had been meaning to change because I have had some performance issues in
the past (nowhere near as bad as this issue though!) but hadn't yet gotten
around to it. After switching this weekend, the issue went away completely.

Even though I had previously noticed some performance issues with my
antivirus, I must say this still is a very weird failure mode - especially
as it still occurred if I disabled all realtime protection (one of the
first things I tried, even if not mentioned in my earlier emails).

Thanks for the time spent trying to help!

Best regards,
Roel



Re: Random slow queries

От
Jim Nasby
Дата:
On 6/29/16 1:01 PM, devel.brain99@xoxy.net wrote:
> During my test run, there was one offending query invocation, a simple
> SELECT * FROM job WHERE field = $1
> Of course the actual query specified the list of fields as it was generated
> by Hibernate, but that is what it boils down to - no joins etc. The column on
> which was queried is a VARCHAR(64) NOT NULL, not unique nor indexed (though
> in practice most values are unique).

Be careful about your assumptions there... SELECT * can have redically
different performance than selecting individual fields. In particular,
if you select something that's been toasted external, that's going to
produce it's own index scan of the toast table, which could then run
into conflicts with vacuuming.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461