Обсуждение: PostgreSQL's processes blocking each other are not detected as deadlock

Поиск
Список
Период
Сортировка

PostgreSQL's processes blocking each other are not detected as deadlock

От
Sorin Mircioiu
Дата:
OS: Oracle Linux Server 9.1
Kernel: Linux 5.15.0-100.96.32.el9uek.x86_64

PostgreSQL's version: 13.10
Configuration: deadlock_timeout: 1s, log_lock_waits: off

I found in the logs some detected deadlocks:

```
2023-09-22 16:54:19.425 UTC 650dc6ba.5732 497942 22322 ERROR:  deadlock detected
2023-09-22 16:54:19.425 UTC 650dc6ba.5732 497942 22322 DETAIL:  Process 22322 waits for ExclusiveLock on tuple (0,48) of relation 415480 of database 16873; blocked by process 13502.
        Process 13502 waits for ShareLock on transaction 497938; blocked by process 13482.
        Process 13482 waits for ShareLock on transaction 497932; blocked by process 21544.
        Process 21544 waits for ExclusiveLock on tuple (0,48) of relation 415480 of database 16873; blocked by process 22322.
```

```
2023-09-22 16:54:20.292 UTC 650dc538.34aa 497938 13482 ERROR:  deadlock detected
2023-09-22 16:54:20.292 UTC 650dc538.34aa 497938 13482 DETAIL:  Process 13482 waits for ShareLock on transaction 497932; blocked by process 21544.
        Process 21544 waits for ExclusiveLock on tuple (0,48) of relation 415480 of database 16873; blocked by process 13502.
        Process 13502 waits for ShareLock on transaction 497938; blocked by process 13482.
```

Several hours later (after the detected deadlocks) I can see that two processes from the above logs are mentioned in the pg_locks table.

According to official documentation (<https://wiki.postgresql.org/wiki/Lock_Monitoring#.D0.A1ombination_of_blocked_and_blocking_activity>) I can view the blocked and blocking activity using the following queries:

Query (locks)

```
select relation::regclass, * from pg_locks where not granted;
```

Output (locks)

```
-[ RECORD 1 ]------+--------------
relation           |
locktype           | transactionid
database           |
relation           |
page               |
tuple              |
virtualxid         |
transactionid      | 497932
classid            |
objid              |
objsubid           |
virtualtransaction | 16/1248
pid                | 13502
mode               | ShareLock
granted            | f
fastpath           | f

-[ RECORD 2 ]------+--------------
relation           |
locktype           | transactionid
database           |
relation           |
page               |
tuple              |
virtualxid         |
transactionid      | 497941
classid            |
objid              |
objsubid           |
virtualtransaction | 41/34
pid                | 21544
mode               | ShareLock
granted            | f
fastpath           | f
```

Query (blocking pids)

```
SELECT blocked_locks.pid     AS blocked_pid,
     blocked_activity.usename  AS blocked_user,
     blocking_locks.pid     AS blocking_pid,
     blocking_activity.usename AS blocking_user,
     blocked_activity.query    AS blocked_statement,
     blocking_activity.query   AS current_statement_in_blocking_process
FROM  pg_catalog.pg_locks         blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks         blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
```

Output (blocking pids):

```
-[ RECORD 1 ]-------------------------+-----------------------------------
blocked_pid                           | 13502
blocked_user                          | my_username
blocking_pid                          | 21544
blocking_user                         | my_username
blocked_statement                     | SELECT * FROM my_procedure_1()
current_statement_in_blocking_process | SELECT * FROM my_procedure_2()

-[ RECORD 2 ]-------------------------+-----------------------------------
blocked_pid                           | 21544
blocked_user                          | my_username
blocking_pid                          | 13502
blocking_user                         | my_username
blocked_statement                     | SELECT * FROM my_procedure_2()
current_statement_in_blocking_process | SELECT * FROM my_procedure_1()
```

Linux console output:

```
ps -ef | grep postgres | grep waiting

  postgres   13502     972  0 Sep22 ?        00:00:01 postgres: 127.0.0.1(43736) SELECT waiting
  postgres   21544     972  0 Sep22 ?        00:00:00 postgres: 127.0.0.1(41848) SELECT waiting
```

```
strace -p 13502
  strace: Process 13502 attached
  epoll_wait(96, ^Cstrace: Process 13502 detached
   <detached ...>

strace -p 21544
  strace: Process 21544 attached
  epoll_wait(77, ^Cstrace: Process 21544 detached
   <detached ...>
```

I understand that I have to modify the code in order to avoid the deadlocks. I will do that.

I need answers to the following questions:

1. 13502 is blocked by 21544 and 21544 is blocked by 13502. Why PostgreSQL didn't detect this situation as a deadlock ? The fact that I'm using procedure calls could mislead PostgreSQL ?
2. The lock mode used is ShareLock (as mentioned in the locks output). The official documentation (<https://www.postgresql.org/docs/13/explicit-locking.html>) says that ShareLock can be "Acquired by CREATE INDEX (without CONCURRENTLY)."  I'm not using "CREATE INDEX" or something related to indexes in my queries. The question is, why is ShareLock used for the locking system in my situation ?

Re: PostgreSQL's processes blocking each other are not detected as deadlock

От
Tom Lane
Дата:
Sorin Mircioiu <sorin.mircioiu@gmail.com> writes:
> I need answers to the following questions:

> 1. 13502 is blocked by 21544 and 21544 is blocked by 13502. Why PostgreSQL
> didn't detect this situation as a deadlock ?

You have not demonstrated that these two processes are waiting for
each other, only that they are waiting for something.  The subset
of data you presented is insufficient to conclude that a deadlock
should have been reported.  It does appear that some progress
has been made since the previous deadlock, since neither process
is waiting for the same thing it was waiting on as of the deadlock
report.

In the past we've seen cases where a deadlock existed but the
server could not detect it because one step in the waits-for
loop was implemented by client-side behavior rather than a
server-visible lock request.  Not saying that's happened here,
but you might want to consider your application's behavior
with that idea in mind.

> why is ShareLock used for the locking system in my situation ?

ShareLock on an XID is a different thing from ShareLock on a table.
IIRC we request ShareLock on an XID as a way of waiting for the
transaction using that XID to terminate, which is a need that only
comes up in some edge cases like CREATE INDEX CONCURRENTLY.  Have you
identified exactly what statements those functions are executing?

            regards, tom lane



Re: PostgreSQL's processes blocking each other are not detected as deadlock

От
Sorin Mircioiu
Дата:
Thank you for the answer.

I have not yet identified the current statements these functions are executing.

select pid,
       usename,
       pg_blocking_pids(pid) as blocked_by,
       query as blocked_query
from pg_stat_activity
where cardinality(pg_blocking_pids(pid)) > 0;

  pid  |  usename   | blocked_by |          blocked_query
-------+------------+------------+-------------------------------
 13502 | my_username | {21544}    | SELECT * FROM my_procedure_1()
 21544 | my_username | {13502}    | SELECT * FROM my_procedure_2()
(2 rows)

The manual says about pg_blocking_pids:
    1. One server process blocks another if it either holds a lock that conflicts with the blocked process's lock request (hard block),
    2. Or is waiting for a lock that would conflict with the blocked process's lock request and is ahead of it in the wait queue (soft block).

Based on what you said I presume that pg_blocking_pids shows in my situation that it is waiting for a lock.

SELECT * FROM pg_locks WHERE pid NOT IN (13502, 21544, (SELECT pg_backend_pid()));
returns zero rows so my understanding is that there are no other locks from other processes.

SELECT pid FROM pg_stat_activity WHERE state != 'idle' AND pid NOT IN (13502,21544, (SELECT pg_backend_pid()));
returns zero rows so there are no other running processes.

Still, if there are no other running processes what could block a process to wait for a lock ?

Below I present a gcore log. Both processes show the same output.

#0  0x00007f3569b9176a in epoll_wait () from /lib64/libc.so.6
#1  0x000055f4c3124471 in WaitEventSetWait ()
#2  0x000055f4c3124872 in WaitLatchOrSocket ()
#3  0x000055f4c313cc8b in ProcSleep ()
#4  0x000055f4c3132c3b in WaitOnLock ()
#5  0x000055f4c3132fa9 in LockAcquireExtended ()
#6  0x000055f4c3133ef6 in XactLockTableWait ()
#7  0x000055f4c2e49dcf in heap_update ()
#8  0x000055f4c2e51451 in heapam_tuple_update.lto_priv ()
#9  0x000055f4c2fee3cf in ExecUpdate ()
#10 0x000055f4c2fefce4 in ExecModifyTable ()
#11 0x000055f4c2fbff42 in standard_ExecutorRun ()
#12 0x000055f4c3147ec5 in ProcessQuery ()
#13 0x000055f4c3149c26 in PortalRunMulti ()
#14 0x000055f4c3149dba in FillPortalStore ()
#15 0x000055f4c314a424 in PortalRunFetch ()
#16 0x000055f4c2ffe536 in _SPI_cursor_operation.lto_priv.0 ()
#17 0x00007f35415d8145 in exec_for_query () from /usr/lib64/pgsql/plpgsql.so
#18 0x00007f35415d8be6 in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#19 0x00007f35415db55e in exec_stmts () from /usr/lib64/pgsql/plpgsql.so
#20 0x00007f35415da5fc in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#21 0x00007f35415db55e in exec_stmts () from /usr/lib64/pgsql/plpgsql.so
#22 0x00007f35415da5fc in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#23 0x00007f35415db55e in exec_stmts () from /usr/lib64/pgsql/plpgsql.so
#24 0x00007f35415da5fc in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#25 0x00007f35415db55e in exec_stmts () from /usr/lib64/pgsql/plpgsql.so
#26 0x00007f35415db8cc in exec_stmt_block () from /usr/lib64/pgsql/plpgsql.so
#27 0x00007f35415d9b6b in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#28 0x00007f35415dab81 in plpgsql_exec_function () from /usr/lib64/pgsql/plpgsql.so
#29 0x00007f35415e323b in plpgsql_call_handler () from /usr/lib64/pgsql/plpgsql.so
#30 0x000055f4c2fb6863 in ExecInterpExpr ()
#31 0x000055f4c2fef60a in ExecResult ()
#32 0x000055f4c2fbff42 in standard_ExecutorRun ()
#33 0x000055f4c2ffdd2d in _SPI_execute_plan ()
#34 0x000055f4c2ffe126 in SPI_execute_plan_with_paramlist ()
#35 0x00007f35415d7c7c in exec_stmt_execsql () from /usr/lib64/pgsql/plpgsql.so
#36 0x00007f35415d865b in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#37 0x00007f35415db55e in exec_stmts () from /usr/lib64/pgsql/plpgsql.so
#38 0x00007f35415db8cc in exec_stmt_block () from /usr/lib64/pgsql/plpgsql.so
#39 0x00007f35415d9b6b in exec_stmt () from /usr/lib64/pgsql/plpgsql.so
#40 0x00007f35415dab81 in plpgsql_exec_function () from /usr/lib64/pgsql/plpgsql.so
#41 0x00007f35415e323b in plpgsql_call_handler () from /usr/lib64/pgsql/plpgsql.so
#42 0x000055f4c2fcc7da in ExecMakeTableFunctionResult ()
#43 0x000055f4c2fdd9fa in FunctionNext ()
#44 0x000055f4c2fbff42 in standard_ExecutorRun ()
#45 0x000055f4c3148375 in PortalRunSelect ()
#46 0x000055f4c314a0a6 in PortalRun ()
#47 0x000055f4c31453d3 in exec_simple_query.lto_priv ()
#48 0x000055f4c314c9ce in PostgresMain ()
#49 0x000055f4c30cdcdf in ServerLoop ()
#50 0x000055f4c30ceb57 in PostmasterMain ()
#51 0x000055f4c2e04171 in main ()

   



On Tue, Sep 26, 2023 at 5:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Sorin Mircioiu <sorin.mircioiu@gmail.com> writes:
> I need answers to the following questions:

> 1. 13502 is blocked by 21544 and 21544 is blocked by 13502. Why PostgreSQL
> didn't detect this situation as a deadlock ?

You have not demonstrated that these two processes are waiting for
each other, only that they are waiting for something.  The subset
of data you presented is insufficient to conclude that a deadlock
should have been reported.  It does appear that some progress
has been made since the previous deadlock, since neither process
is waiting for the same thing it was waiting on as of the deadlock
report.

In the past we've seen cases where a deadlock existed but the
server could not detect it because one step in the waits-for
loop was implemented by client-side behavior rather than a
server-visible lock request.  Not saying that's happened here,
but you might want to consider your application's behavior
with that idea in mind.

> why is ShareLock used for the locking system in my situation ?

ShareLock on an XID is a different thing from ShareLock on a table.
IIRC we request ShareLock on an XID as a way of waiting for the
transaction using that XID to terminate, which is a need that only
comes up in some edge cases like CREATE INDEX CONCURRENTLY.  Have you
identified exactly what statements those functions are executing?

                        regards, tom lane