Обсуждение: circular wait not triggering deadlock ?
Running pg10.2, I have a handful of maintenance jobs run in the middle of the night, which appear to have gotten stuck waiting on each other.. . reindex - loop around pg_repack or psql -c REINDEX $table for those which can't be repacked) . vacuum - mostly runs ANALYZE, like on empty parent tables I believe I've come across something similar before, but hadn't previously stopped to dig into it. ts=# SELECT * FROM pg_stat_activity WHERE pid=20488; -[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------------------------------ datid | 16402 datname | ts pid | 20488 usesysid | 10 usename | postgres application_name | pg_repack client_addr | client_hostname | client_port | -1 backend_start | 2018-03-07 01:00:15.72433-07 xact_start | 2018-03-07 01:00:15.747619-07 query_start | 2018-03-07 01:00:15.747619-07 state_change | 2018-03-07 01:00:15.74762-07 wait_event_type | Lock wait_event | virtualxid state | active backend_xid | backend_xmin | query | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH (fillfactor='100')TABLESPACE oldindex backend_type | client backend ts=# SELECT * FROM pg_stat_activity WHERE pid=530; -[ RECORD 1 ]----+-------------------------------------------------------------------------- datid | 16402 datname | ts pid | 530 usesysid | 10 usename | postgres application_name | psql client_addr | client_hostname | client_port | -1 backend_start | 2018-03-07 02:11:59.00832-07 xact_start | 2018-03-07 02:11:59.009855-07 query_start | 2018-03-07 02:11:59.009855-07 state_change | 2018-03-07 02:11:59.009861-07 wait_event_type | Lock wait_event | relation state | active backend_xid | backend_xmin | 67578467 query | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400 backend_type | client backend ts=# SELECT relation::regclass, granted, * FROM pg_locks WHERE (pid=530 AND NOT granted) OR relation='eric_enodeb_cell_20180304'::regclass;-- virtualxid='16/38159763' OR relation='eric_enodeb_cell_metrics'::regclassORDER BY 1,2,3; relation | granted | locktype | database | relation | page | tuple | virtualxid | transactionid | classid| objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ---------------------------+---------+----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+---------- eric_enodeb_cell_20180304 | t | relation | 16402 | 1113966172 | | | | | | | | 19/28401734 | 20488 | ShareUpdateExclusiveLock | t | f eric_enodeb_cell_20180304 | f | relation | 16402 | 1113966172 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f I see PID 530 has ShareUpdateExclusiveLock on many child tables. ..but NOT on eric_enodeb_cell_20180304 , which #20488 is trying to reindex. ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=530 OR relation='eric_enodeb_cell_metrics'::regclass ORDERBY 1, 2, 3; granted | relation | locktype | database | relation | page | tuple | virtualxid | transactionid |classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ---------+---------------------------+------------+----------+------------+------+-------+-------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+---------- f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 32/838769 | 18941 | AccessShareLock | f | f f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 18/38156962 | 16592 | AccessExclusiveLock | f | f f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 31/1337307 | 22961 | AccessShareLock | f | f f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 27/1755026 | 17248 | AccessShareLock | f | f f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 28/1118276 | 28357 | AccessShareLock | f | f f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 29/830016 | 21846 | AccessExclusiveLock | f | f f | eric_enodeb_cell_20180304 | relation | 16402 | 1113966172 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f t | eric_enodeb_cell_201609 | relation | 16402 | 193318232 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_201610 | relation | 16402 | 193318241 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_201611 | relation | 16402 | 206004215 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_201612 | relation | 16402 | 232980573 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_201701 | relation | 16402 | 259721731 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_201702 | relation | 16402 | 291841516 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_20170301 | relation | 16402 | 384321707 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_20170302 | relation | 16402 | 385254657 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f t | eric_enodeb_cell_20170303 | relation | 16402 | 386180686 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f ... ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=20488 ORDER BY 1,2,3; -[ RECORD 1 ]------+-------------------------- granted | f locktype | virtualxid virtualxid | 22/4317099 virtualtransaction | 19/28401734 pid | 20488 mode | ShareLock granted | f fastpath | f ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3; granted | pid | mode | virtualtransaction | virtualxid ---------+-------+---------------+--------------------+------------ f | 20488 | ShareLock | 19/28401734 | 22/4317099 t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099 [pryzbyj@database ~]$ ps -O lstart -C reindex,pg_repack PID STARTED S TTY TIME COMMAND 20038 Wed Mar 7 01:00:00 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex 20484 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex 20485 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /usr/pgsql-10/bin/pg_repack -E warning --no-kill-backend -d ts -i eric_enodeb_cell_20180304_site_idx-Ss oldindex Note, I believe last night our backup job would've run for longer and processed (many) more tables than usual, looping around pg_dump --snapshot. Is it a bug that this isn't caught by a deadlock detector and cancelled? Thanks, Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > Running pg10.2, I have a handful of maintenance jobs run in the middle of the > night, which appear to have gotten stuck waiting on each other.. > ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3; > granted | pid | mode | virtualtransaction | virtualxid > ---------+-------+---------------+--------------------+------------ > f | 20488 | ShareLock | 19/28401734 | 22/4317099 > t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099 PID 20488 is evidently waiting for PID 6471 to finish its transaction. What's that one doing? > Is it a bug that this isn't caught by a deadlock detector and cancelled? We did find a bug recently with concurrent CREATE INDEX CONCURRENTLY's all waiting for each other to commit. regards, tom lane
On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > Running pg10.2, I have a handful of maintenance jobs run in the middle of the > > night, which appear to have gotten stuck waiting on each other.. > > > ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3; > > granted | pid | mode | virtualtransaction | virtualxid > > ---------+-------+---------------+--------------------+------------ > > f | 20488 | ShareLock | 19/28401734 | 22/4317099 > > t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099 > > PID 20488 is evidently waiting for PID 6471 to finish its transaction. > What's that one doing? Um, I thought I had kept track of all two pids but looks not.. query | SELECT pg_export_snapshot(); So that explains that. I already killed one proc, but.. ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE pid=17248 ORDER BY1; granted | relation | pid | mode | virtualtransaction | virtualxid ---------+-----------------------------------------+-------+-----------------+--------------------+------------ f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 | ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE relation='eric_enodeb_cell_metrics'::regclassORDER BY 1; --pid=17248 virtualxid='22/4317099' ORDER BY 1,2,3; granted | relation | pid | mode | virtualtransaction | virtualxid ---------+--------------------------+-------+--------------------------+--------------------+------------ f | eric_enodeb_cell_metrics | 22961 | AccessShareLock | 31/1337307 | f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 | f | eric_enodeb_cell_metrics | 28357 | AccessShareLock | 28/1118276 | f | eric_enodeb_cell_metrics | 21846 | AccessExclusiveLock | 29/830016 | f | eric_enodeb_cell_metrics | 16592 | AccessExclusiveLock | 18/38156962 | f | eric_enodeb_cell_metrics | 18941 | AccessShareLock | 32/838769 | t | eric_enodeb_cell_metrics | 530 | ShareUpdateExclusiveLock | 16/38159763 | (7 rows) ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid IN(530,17248); 530 | active | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400 17248 | active | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl,row_n FROM pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault( CASE WHEN c.relkind = 'S' ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE relation='eric_enodeb_cell_20180304'::regclass; granted | pid | mode | virtualtransaction | virtualxid ---------+-------+--------------------------+--------------------+------------ t | 20488 | ShareUpdateExclusiveLock | 19/28401734 | f | 530 | ShareUpdateExclusiveLock | 16/38159763 | ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid=20488; pid | state | left -------+--------+------------------------------------------------------------------------------------------------------------------------------------------- 20488 | active | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH (fillfactor='100')TABLESPACE oldindex ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualxid, virtualtransaction FROM pg_locks WHEREpid=20488 ORDER BY 1; granted | pid | relation | locktype | virtualxid | virtualxid | virtualtransaction ---------+-------+---------------------------+------------+--------------------------+-------------+-------------------- f | 20488 | | virtualxid | ShareLock | 22/4317099 | 19/28401734 ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualtransaction FROM pg_locks WHERE virtualxid='22/4317099'ORDER BY 1; granted | pid | relation | locktype | virtualxid | virtualtransaction ---------+-------+----------+------------+---------------+-------------------- f | 20488 | | virtualxid | ShareLock | 19/28401734 t | 6471 | | virtualxid | ExclusiveLock | 22/4317099 (2 rows) So...I gather ALTER SET STATS is waiting on pg_dump which is waiting on CREATE INDEX which is waiting on SELECT pg_export_snapshot(), which I take to mean that I should avoid running pg_repack until the backup is finished.. ts=# SELECT pid, state, application_name app, left(query,99) FROM pg_stat_activity ORDER BY xact_start LIMIT 9; pid | state | app | left -------+---------------------+-----------+----------------------------------------------------------------------------------------------------- 6471 | idle in transaction | psql | SELECT pg_export_snapshot(); 20488 | active | pg_repack | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USINGbtree (site_id) WITH 530 | active | psql | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400 17248 | active | pg_dump | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BYrow_n) FROM (SELECT Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote: >> PID 20488 is evidently waiting for PID 6471 to finish its transaction. >> What's that one doing? > Um, I thought I had kept track of all two pids but looks not.. > query | SELECT pg_export_snapshot(); pg_export_snapshot doesn't block for other transactions, though. Further down, you have output that confirms that: > 6471 | idle in transaction | psql | SELECT pg_export_snapshot(); That SELECT is the most recently completed command, not the current one. So now the question is what the connected application is waiting for before committing. regards, tom lane
On Thu, Mar 08, 2018 at 03:05:36PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote: > >> PID 20488 is evidently waiting for PID 6471 to finish its transaction. > >> What's that one doing? > > > Um, I thought I had kept track of all two pids but looks not.. > > query | SELECT pg_export_snapshot(); > > pg_export_snapshot doesn't block for other transactions, though. > Further down, you have output that confirms that: > > > 6471 | idle in transaction | psql | SELECT pg_export_snapshot(); > > That SELECT is the most recently completed command, not the current one. > So now the question is what the connected application is waiting for > before committing. I believe it does an idle loop around sleep(1), until all the pg_dump --snapshot children are done. Justin