The following bug has been logged on the website:
Bug reference: 14918
Logged by: Flavio Gurgel
Email address: fhagur@gmail.com
PostgreSQL version: 9.6.6
Operating system: Debian 8.9 Linux 3.16.0-4-amd64
Description:
Hello
A lock on a relation in a standby server caused a query to stay waiting.
Owr in-house application load-balances some non time-critical read queries
to an assynchronous streaming standby server.
The standby server was running PostgreSQL 9.6.6 and the master server was
running 9.6.5 (today the master server has been updated to 9.6.6).
The query is a single table very fast select like this (predicate was
anonymized for non-disclosure, we're not running prepared statements
here):
SELECT search_results.* FROM search_results WHERE search_results.profile_id
= ? ORDER BY array_length(agenda_ids, 1) desc nulls last, practice_id;
The table structure is:
Table
"public.search_results"
Column | Type | Collation |
Nullable | Default
---------------------------------+-----------------------------+-----------+----------+--------------------------------------------
id | integer | |
not null | nextval('search_results_id_seq'::regclass)
practice_id | integer | |
|
speciality_id | integer | |
|
profile_id | integer | |
not null |
agenda_ids | character varying(255)[] | |
|
name | character varying(255) | |
|
latitude | double precision | |
|
longitude | double precision | |
|
created_at | timestamp without time zone | |
|
updated_at | timestamp without time zone | |
|
searchable | boolean | |
not null | false
rank | integer | |
not null | 0
zipcode | character varying(255) | |
not null |
country | character varying(2) | |
|
visit_motive_label_ids | integer[] | |
not null | '{}'::integer[]
external_availabilities_service | boolean | |
not null | false
profile_practice_detail_id | integer | |
|
visit_motive_ids | integer[] | |
| '{}'::integer[]
name_search_only | boolean | |
not null | false
top_specialities | character varying[] | |
not null | '{}'::character varying[]
Indexes:
"search_results_pkey" PRIMARY KEY, btree (id)
"index_search_results_on_practice_id" btree (practice_id)
"index_search_results_on_profile_id" btree (profile_id)
"index_search_results_on_profile_practice_detail_id" btree
(profile_practice_detail_id)
"index_search_results_on_speciality_id" btree (speciality_id)
"tsvector_on_search_results" gin
(to_tsvector('doctolib_name_search'::regconfig, COALESCE(name, ''::character
varying)::text))
Foreign-key constraints:
"search_results_practice_id_fk" FOREIGN KEY (practice_id) REFERENCES
practices(id)
"search_results_profile_id_fk" FOREIGN KEY (profile_id) REFERENCES
profiles(id)
"search_results_profile_practice_detail_id_fk" FOREIGN KEY
(profile_practice_detail_id) REFERENCES profile_practice_details(id)
"search_results_speciality_id_fk" FOREIGN KEY (speciality_id) REFERENCES
specialities(id)
Several similar queries stacked up until we restarted postmaster. A
statement_timeout of 7 seconds is set to all queries on standby.
The blocking process was probably the wal receiver, since we didn't found
queries from the blocking PID on the logs, and we didn't run a PS during the
incident to confirm, we lost track of this with the postmaster restart,
nothing else is logged.
max_standby_streaming_delay is set to -1. We're used to see some replication
lag caused by master cleanups on tables related to running queries on
standby. During this incident, streaming lag exploded to several GB, we have
it monitored from pg_stat_replication view.
The only thing we found on master was an autovacuum that logged this,
exactly at the beggining of the incident:
automatic vacuum of table "doctolib.public.search_results": index scans: 1
pages: 7512 removed, 8138 remain, 0 skipped due to pins, 7801 skipped
frozen
tuples: 0 removed, 264186 remain, 0 are dead but not yet removable
buffer usage: 49032 hits, 35850 misses, 18054 dirtied
avg read rate: 4.068 MB/s, avg write rate: 2.049 MB/s
system usage: CPU 0.62s/1.30u sec elapsed 68.84 sec
This table is fully refreshed every hour hour with this transaction:
BEGIN;
DELETE FROM search_results;
Loop here - INSERT INTO search_results...;
COMMIT;
The inserts loop can take tens of minutes, around half an hour. Delete runs
in less then a second.
This autovacuum worker seemed to appear just after this refresh operation,
as expected.
No recent changes were made in the table's structure or the query. We see
these operations all the time in our application and we can say that it's
the first time in months that we saw this lock behaviour in the standby
server. Locks are also monitored, relation locks never appeared before.
Since this lock is not expected in a standby server configured this way,
we're suspecting that a bug in PostgreSQL is possible.
Many thanks,
Flavio Gurgel