BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load

Поиск
Список
Период
Сортировка
От	PG Bug reporting form
Тема	BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load
Дата	21 августа 2024 г. 14:53:29
Msg-id	18586-03e1535b1b34db81@postgresql.org обсуждение исходный текст
Список	pgsql-bugs
Дерево обсуждения
The following bug has been logged on the website:

Bug reference:      18586
Logged by:          Dimar van Rietschoten
Email address:      d_van_rietschoten@hotmail.com
PostgreSQL version: 15.7
Operating system:   Unknown (GCP managed)
Description:

We have a small database server (2 cores, 4 gigs). We performed an initial
load, putting quite a bit of load on this database. The initial load is
performed by external kubernetes pods.
After a while, we saw the nr of connection to the database increased, until
we ran out of connections.

We found that one of the processes was stuck in waiting for IPC (for many
hours). This transaction held a lock, causing other transactions to wait on
that lock.
It seems that after a while, the application doing the initial load
abandoned the transaction / connection and created a new one. This new
connection than hit the same lock, with the same outcome.

We could reproduce with a somewhat larger database (4 CPU, 8 gigs). With 6
CPU, 12 gigs we did not run into this problem.

Query used to find the stuck transactions:
select * from pg_activity order by backend_start asc;

Result:
<null>  <null>            21      <null>    <null>  <null>
                              <Other>      <null>                <null>
2024-08-16 13:17:02.331128            <null>                      <null>
                 <null>            Timeout          CheckpointWriteDelay
<null>               <Other>      <Other>                     <null>






                          checkpointer                
<null>  <null>            22      <null>    <null>  <null>
                              <Other>      <null>                <null>
2024-08-16 13:17:02.332034            <null>                      <null>
                 <null>            Activity         BgWriterMain
<null>               <Other>      <Other>                     <null>






                          background writer           
<null>  <null>            24      <null>    <null>  <null>
                              <Other>      <null>                <null>
2024-08-16 13:17:02.372296            <null>                      <null>
                 <null>            Activity         WalWriterMain
<null>               <Other>      <Other>                     <null>






                          walwriter                   
<null>  <null>            25      <null>    <null>  <null>
                              <Other>      <null>                <null>
2024-08-16 13:17:02.373328            <null>                      <null>
                 <null>            Activity         AutoVacuumMain
<null>               <Other>      <Other>                     <null>






                          autovacuum launcher         
<null>  <null>            26      <null>    <null>  <null>
                              <Other>      <null>                <null>
2024-08-16 13:17:02.374639            <null>                      <null>
                 <null>            Activity         ArchiverMain
<null>               <Other>      <Other>                     <null>






                          archiver                    
<null>  <null>            27      <null>        10  cloudsqladmin
                              <Other>      <null>                <null>
2024-08-16 13:17:02.374752            <null>                      <null>
                 <null>            Activity         LogicalLauncherMain
<null>               <Other>      <Other>                     <null>






                          logical replication launcher
 16384  cloudsqladmin     30      <null>        10  cloudsqladmin
      cloudsqlagent           <Other>      <null>                 56364
2024-08-16 13:17:02.523891            <null>            2024-08-16
15:28:28.290619  2024-08-16 15:28:28.29065   Client           ClientRead
       idle                 <Other>      <Other>                     <null>
;





                                 client backend              
 16488  iss              140      <null>     16490  <I hid the username>
PostgreSQL JDBC Driver  <Other>      <null>                    -1
2024-08-16 13:17:06.768402  2024-08-16 13:27:41.172003  2024-08-16
13:27:41.439554  2024-08-16 13:27:41.439557  IPC              ParallelFinish
       active               <Other>      <Other>                     <null>
select

l1_0.inbound_shipment_id,l1_0.inbound_shipment_line_id,l1_0.ean,lc1_0.id,lc1_0.announcement_type,lc1_0.load_carrier_id,lc1_0.state,l1_0.quantity,l1_0.sku
from inbound_shipment_lines l1_0 left join load_carriers lc1_0 on
lc1_0.id=l1_0.load_carrier_id where l1_0.inbound_shipment_id=$1


                                     client backend             
... (there were other queries)

I found that this process (140) was blocking other queries using pg_locks.
These make sense though (this is the application logic).

My question:
Why did PID 140 never complete (we waited ~12 hours)? The worst case
scenario has little (< 20.000 rows). I asked postgres for a query plan for a
similar query, it comes up with:
Gather  (cost=1015.66..4488.84 rows=1376 width=104) (actual
time=6.434..14.233 rows=4 loops=1)
                           
  Workers Planned: 1

          
  Workers Launched: 1

          
  ->  Nested Loop Left Join  (cost=15.66..3351.24 rows=809 width=104)
(actual time=2.583..2.587 rows=2 loops=2)
                 
        ->  Parallel Bitmap Heap Scan on inbound_shipment_lines l1_0
(cost=15.23..1400.35 rows=809 width=62) (actual time=2.574..2.576 rows=2
loops=2)             
              Recheck Cond: (inbound_shipment_id = 6134742)

          
              Heap Blocks: exact=1

          
              ->  Bitmap Index Scan on
inbound_shipment_lines_inbound_shipment_id_idx  (cost=0.00..14.88 rows=1376
width=0) (actual time=3.551..3.551 rows=4 loops=1)
                    Index Cond: (inbound_shipment_id = 6134742)

          
        ->  Index Scan using load_carriers_pkey on load_carriers lc1_0
(cost=0.43..2.41 rows=1 width=46) (actual time=0.002..0.002 rows=0 loops=4)
               
              Index Cond: (id = l1_0.load_carrier_id)     

This takes very little time to complete (even for the worst case
occurrence).

Although we put (too?) much load on the database, I would not expect the
process to hang. I'd also expect the transaction to be rolled back when the
connection is closed. I tried stopping all clients, but the connections
remained on the database server.
With the 2 cores, 4 gigs system the database CPU usage was constant at 100%.
With 4 cores, 8 gigs, it was constant around 95%. With 6 cores, 12 gigs at
was a little over 60%

I saw some similar posts
(https://postgrespro.com/list/id/0f64b4c7fc200890f2055ce4d6650e9c2191fac2.camel@j-davis.com
for instance, or
https://www.postgresql.org/message-id/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local)

I understand the issue is a bit vague, but it’s all I’ve got:
The IPC process seems to get stuck under heavy load
The connections / transactions were not closed after the client went away
В списке pgsql-bugs по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load