Обсуждение: circular wait not triggering deadlock ?
Running pg10.2, I have a handful of maintenance jobs run in the middle of the
night, which appear to have gotten stuck waiting on each other..
. reindex - loop around pg_repack or psql -c REINDEX $table for those which can't
be repacked)
. vacuum - mostly runs ANALYZE, like on empty parent tables
I believe I've come across something similar before, but hadn't previously
stopped to dig into it.
ts=# SELECT * FROM pg_stat_activity WHERE pid=20488;
-[ RECORD 1
]----+------------------------------------------------------------------------------------------------------------------------------------------
datid | 16402
datname | ts
pid | 20488
usesysid | 10
usename | postgres
application_name | pg_repack
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-03-07 01:00:15.72433-07
xact_start | 2018-03-07 01:00:15.747619-07
query_start | 2018-03-07 01:00:15.747619-07
state_change | 2018-03-07 01:00:15.74762-07
wait_event_type | Lock
wait_event | virtualxid
state | active
backend_xid |
backend_xmin |
query | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH
(fillfactor='100')TABLESPACE oldindex
backend_type | client backend
ts=# SELECT * FROM pg_stat_activity WHERE pid=530;
-[ RECORD 1 ]----+--------------------------------------------------------------------------
datid | 16402
datname | ts
pid | 530
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-03-07 02:11:59.00832-07
xact_start | 2018-03-07 02:11:59.009855-07
query_start | 2018-03-07 02:11:59.009855-07
state_change | 2018-03-07 02:11:59.009861-07
wait_event_type | Lock
wait_event | relation
state | active
backend_xid |
backend_xmin | 67578467
query | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
backend_type | client backend
ts=# SELECT relation::regclass, granted, * FROM pg_locks WHERE (pid=530 AND NOT granted) OR
relation='eric_enodeb_cell_20180304'::regclass;-- virtualxid='16/38159763' OR
relation='eric_enodeb_cell_metrics'::regclassORDER BY 1,2,3;
relation | granted | locktype | database | relation | page | tuple | virtualxid | transactionid |
classid| objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
---------------------------+---------+----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+----------
eric_enodeb_cell_20180304 | t | relation | 16402 | 1113966172 | | | | |
| | | 19/28401734 | 20488 | ShareUpdateExclusiveLock | t | f
eric_enodeb_cell_20180304 | f | relation | 16402 | 1113966172 | | | | |
| | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f
I see PID 530 has ShareUpdateExclusiveLock on many child tables.
..but NOT on eric_enodeb_cell_20180304 , which #20488 is trying to reindex.
ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=530 OR relation='eric_enodeb_cell_metrics'::regclass
ORDERBY 1, 2, 3;
granted | relation | locktype | database | relation | page | tuple | virtualxid | transactionid
|classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
---------+---------------------------+------------+----------+------------+------+-------+-------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+----------
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 32/838769 | 18941 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 18/38156962 | 16592 | AccessExclusiveLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 31/1337307 | 22961 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 27/1755026 | 17248 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 28/1118276 | 28357 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | |
| | | | 29/830016 | 21846 | AccessExclusiveLock | f | f
f | eric_enodeb_cell_20180304 | relation | 16402 | 1113966172 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f
t | eric_enodeb_cell_201609 | relation | 16402 | 193318232 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201610 | relation | 16402 | 193318241 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201611 | relation | 16402 | 206004215 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201612 | relation | 16402 | 232980573 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201701 | relation | 16402 | 259721731 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201702 | relation | 16402 | 291841516 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170301 | relation | 16402 | 384321707 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170302 | relation | 16402 | 385254657 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170303 | relation | 16402 | 386180686 | | | |
| | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
...
ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=20488 ORDER BY 1,2,3;
-[ RECORD 1 ]------+--------------------------
granted | f
locktype | virtualxid
virtualxid | 22/4317099
virtualtransaction | 19/28401734
pid | 20488
mode | ShareLock
granted | f
fastpath | f
ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY
1,2,3;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+---------------+--------------------+------------
f | 20488 | ShareLock | 19/28401734 | 22/4317099
t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099
[pryzbyj@database ~]$ ps -O lstart -C reindex,pg_repack
PID STARTED S TTY TIME COMMAND
20038 Wed Mar 7 01:00:00 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex
20484 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex
20485 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /usr/pgsql-10/bin/pg_repack -E warning --no-kill-backend -d ts -i
eric_enodeb_cell_20180304_site_idx-Ss oldindex
Note, I believe last night our backup job would've run for longer and processed
(many) more tables than usual, looping around pg_dump --snapshot.
Is it a bug that this isn't caught by a deadlock detector and cancelled?
Thanks,
Justin
Justin Pryzby <pryzby@telsasoft.com> writes:
> Running pg10.2, I have a handful of maintenance jobs run in the middle of the
> night, which appear to have gotten stuck waiting on each other..
> ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY
1,2,3;
> granted | pid | mode | virtualtransaction | virtualxid
> ---------+-------+---------------+--------------------+------------
> f | 20488 | ShareLock | 19/28401734 | 22/4317099
> t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099
PID 20488 is evidently waiting for PID 6471 to finish its transaction.
What's that one doing?
> Is it a bug that this isn't caught by a deadlock detector and cancelled?
We did find a bug recently with concurrent CREATE INDEX CONCURRENTLY's
all waiting for each other to commit.
regards, tom lane
On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > Running pg10.2, I have a handful of maintenance jobs run in the middle of the
> > night, which appear to have gotten stuck waiting on each other..
>
> > ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY
1,2,3;
> > granted | pid | mode | virtualtransaction | virtualxid
> > ---------+-------+---------------+--------------------+------------
> > f | 20488 | ShareLock | 19/28401734 | 22/4317099
> > t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099
>
> PID 20488 is evidently waiting for PID 6471 to finish its transaction.
> What's that one doing?
Um, I thought I had kept track of all two pids but looks not..
query | SELECT pg_export_snapshot();
So that explains that.
I already killed one proc, but..
ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE pid=17248 ORDER
BY1;
granted | relation | pid | mode | virtualtransaction | virtualxid
---------+-----------------------------------------+-------+-----------------+--------------------+------------
f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 |
ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE
relation='eric_enodeb_cell_metrics'::regclassORDER BY 1; --pid=17248 virtualxid='22/4317099' ORDER BY 1,2,3;
granted | relation | pid | mode | virtualtransaction | virtualxid
---------+--------------------------+-------+--------------------------+--------------------+------------
f | eric_enodeb_cell_metrics | 22961 | AccessShareLock | 31/1337307 |
f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 |
f | eric_enodeb_cell_metrics | 28357 | AccessShareLock | 28/1118276 |
f | eric_enodeb_cell_metrics | 21846 | AccessExclusiveLock | 29/830016 |
f | eric_enodeb_cell_metrics | 16592 | AccessExclusiveLock | 18/38156962 |
f | eric_enodeb_cell_metrics | 18941 | AccessShareLock | 32/838769 |
t | eric_enodeb_cell_metrics | 530 | ShareUpdateExclusiveLock | 16/38159763 |
(7 rows)
ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid IN(530,17248);
530 | active | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
17248 | active | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT
acl,row_n FROM pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S'
ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE
relation='eric_enodeb_cell_20180304'::regclass;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+--------------------------+--------------------+------------
t | 20488 | ShareUpdateExclusiveLock | 19/28401734 |
f | 530 | ShareUpdateExclusiveLock | 16/38159763 |
ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid=20488;
pid | state | left
-------+--------+-------------------------------------------------------------------------------------------------------------------------------------------
20488 | active | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH
(fillfactor='100')TABLESPACE oldindex
ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualxid, virtualtransaction FROM pg_locks
WHEREpid=20488 ORDER BY 1;
granted | pid | relation | locktype | virtualxid | virtualxid | virtualtransaction
---------+-------+---------------------------+------------+--------------------------+-------------+--------------------
f | 20488 | | virtualxid | ShareLock | 22/4317099 | 19/28401734
ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualtransaction FROM pg_locks WHERE
virtualxid='22/4317099'ORDER BY 1;
granted | pid | relation | locktype | virtualxid | virtualtransaction
---------+-------+----------+------------+---------------+--------------------
f | 20488 | | virtualxid | ShareLock | 19/28401734
t | 6471 | | virtualxid | ExclusiveLock | 22/4317099
(2 rows)
So...I gather ALTER SET STATS is waiting on pg_dump which is waiting on CREATE
INDEX which is waiting on SELECT pg_export_snapshot(), which I take to mean
that I should avoid running pg_repack until the backup is finished..
ts=# SELECT pid, state, application_name app, left(query,99) FROM pg_stat_activity ORDER BY xact_start LIMIT 9;
pid | state | app | left
-------+---------------------+-----------+-----------------------------------------------------------------------------------------------------
6471 | idle in transaction | psql | SELECT pg_export_snapshot();
20488 | active | pg_repack | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304
USINGbtree (site_id) WITH
530 | active | psql | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
17248 | active | pg_dump | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER
BYrow_n) FROM (SELECT
Justin
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote:
>> PID 20488 is evidently waiting for PID 6471 to finish its transaction.
>> What's that one doing?
> Um, I thought I had kept track of all two pids but looks not..
> query | SELECT pg_export_snapshot();
pg_export_snapshot doesn't block for other transactions, though.
Further down, you have output that confirms that:
> 6471 | idle in transaction | psql | SELECT pg_export_snapshot();
That SELECT is the most recently completed command, not the current one.
So now the question is what the connected application is waiting for
before committing.
regards, tom lane
On Thu, Mar 08, 2018 at 03:05:36PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote: > >> PID 20488 is evidently waiting for PID 6471 to finish its transaction. > >> What's that one doing? > > > Um, I thought I had kept track of all two pids but looks not.. > > query | SELECT pg_export_snapshot(); > > pg_export_snapshot doesn't block for other transactions, though. > Further down, you have output that confirms that: > > > 6471 | idle in transaction | psql | SELECT pg_export_snapshot(); > > That SELECT is the most recently completed command, not the current one. > So now the question is what the connected application is waiting for > before committing. I believe it does an idle loop around sleep(1), until all the pg_dump --snapshot children are done. Justin