Обсуждение: [BUGS] BUG #14901: Canceled queries missing from pg_stat_statements

Поиск
Список
Период
Сортировка

[BUGS] BUG #14901: Canceled queries missing from pg_stat_statements

От
gregburek@heroku.com
Дата:
The following bug has been logged on the website:

Bug reference:      14901
Logged by:          Greg Burek
Email address:      gregburek@heroku.com
PostgreSQL version: 9.5.10
Operating system:   linux
Description:

Hello,
I recently worked with an app that hit a mysterious wall in Postgres query
performance, where overall query latency had increased dramatically, with
sporadic statement_timeout producing app errors. Examining
pg_stat_statements showed some queries that were consuming resources, which
were then optimized, but the overall query latency and app error rate stayed
constant. 

Only after examining logs did we find a class of query that was being
consistently canceled by statement_timeout=30s. This query was found mostly
by chance as it was a fraction of the total statement_timeout rate, as some
sessions have much more aggressive statement_timeout=2s. Running explain
analyze on this 30s timeout query, we found it to be extremely resource
heavy and ultimately causing the overall degredation of response time.

This was a surprise to the app's developers that a query, which had a large
execution time but was being canceled by a relatively sane
statement_timeout, was missing from pg_stat_statements. That one would need
to access, consume and analyze postgres logs was unexpected, given Postgres
documents
(https://www.postgresql.org/docs/current/static/pgstatstatements.html) that
"The pg_stat_statements module provides a means for tracking execution
statistics of *all* SQL statements executed by a server." That canceled
queries are missing from this view seems like an easy to miss, but vital
exception.

I think this is either a bug in how pgss does not account for canceled
queries that still exert load on a database or in the documentation of pgss
that it only accounts for completed queries and that statement_timeout or
app timeouts can result in query undercounting or omissions. 

Below is a reproduction of this issue, where a pathologically bad query,
which is responsible for the bulk of the load executed on the db, does not
increment or add execution statistics to pg_stat_statements.

I would expect that the final `select * from test order by random() limit
?;` entry of pg_stat_statements would have a count equal 7 and total_time
around 142623.038. 

=# create extension "uuid-ossp";
CREATE EXTENSION
Time: 10.053 ms

=# create extension "pg_stat_statements";
CREATE EXTENSION
Time: 23.607 ms

=# create table test as (   SELECT id as id, uuid_generate_v4() as uuid, uuid_generate_v1() as
name     FROM generate_series(1,(10^8)::int) as id);
SELECT 100000000
Time: 276467.422 ms (04:36.467)

=# select pg_stat_statements_reset();pg_stat_statements_reset
--------------------------

(1 row)

Time: 0.867 ms

=# select query, calls from pg_stat_statements;              query                | calls
------------------------------------+-------select pg_stat_statements_reset(); |     1
(1 row)

Time: 1.221 ms

=# select * from test order by random() limit 1;   id    |                 uuid                 |                 name
----------+--------------------------------------+--------------------------------------79600490 |
46844417-f3ee-4add-b5c5-1ba3e6db5da5| 
e4c118b2-c5a3-11e7-be4b-8c8590305665
(1 row)

Time: 22627.277 ms (00:22.627)

=# select query, calls from pg_stat_statements;                    query                     | calls
-----------------------------------------------+-------select query, calls from pg_stat_statements;  |     1select
pg_stat_statements_reset();           |     1select * from test order by random() limit ?; |     1
 
(3 rows)

Time: 1.439 ms

=# SET statement_timeout='20s';
SET
Time: 8.936 ms

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20073.747 ms (00:20.074)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20000.455 ms (00:20.000)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20065.561 ms (00:20.066)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20019.753 ms (00:20.020)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20056.721 ms (00:20.057)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20001.994 ms (00:20.002)

=# select query, calls, total_time from pg_stat_statements;                    query                     | calls |
total_time
-----------------------------------------------+-------+------------select query, calls from pg_stat_statements;  |
2|      2.063SET statement_timeout='20s';                  |     1 |      0.857select pg_stat_statements_reset();
    |     1 |      2.753select * from test order by random() limit ?; |     1 |  22623.038
 
(4 rows)

Time: 2.536 ms


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14901: Canceled queries missing frompg_stat_statements

От
Andres Freund
Дата:
Hi,

On 2017-11-12 22:28:11 +0000, gregburek@heroku.com wrote:
> I think this is either a bug in how pgss does not account for canceled
> queries that still exert load on a database or in the documentation of pgss
> that it only accounts for completed queries and that statement_timeout or
> app timeouts can result in query undercounting or omissions. 

I think the only sane way out of this one is to change the docs. You
could also add some additional tracking feature, but:


> Below is a reproduction of this issue, where a pathologically bad query,
> which is responsible for the bulk of the load executed on the db, does not
> increment or add execution statistics to pg_stat_statements.
> 
> I would expect that the final `select * from test order by random() limit
> ?;` entry of pg_stat_statements would have a count equal 7 and total_time
> around 142623.038. 

this'd be a horrible idea, because it'd completely skew the cost that
one expects a single query to have. Imagine that one session manually
sets a timeout to 1ms and executes a query normally taking minutes a few
thousand times. The stats wouldn't make a lick of sense anymore.


One way to address this at least partially would be to add a counter for
initiated queries and a counter for completed queries (the current
one). If all the rest of the counters are left the same the issue from
above isn't there, and you'd not even need an exception handler to
update counters.

- Andres


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14901: Canceled queries missing from pg_stat_statements

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-11-12 22:28:11 +0000, gregburek@heroku.com wrote:
>> [ canceled queries aren't accounted for by pg_stat_statements ]

> One way to address this at least partially would be to add a counter for
> initiated queries and a counter for completed queries (the current
> one). If all the rest of the counters are left the same the issue from
> above isn't there, and you'd not even need an exception handler to
> update counters.

That seems like an idea well worth looking into, seeing that there's
already a code path that inserts the entry into the shared table before
execution starts.

This would not account for queries that fail at parse or plan time,
but hopefully those generally don't run long enough to be a problem.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs