BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load
От | PG Bug reporting form |
---|---|
Тема | BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load |
Дата | |
Msg-id | 18586-03e1535b1b34db81@postgresql.org обсуждение исходный текст |
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18586 Logged by: Dimar van Rietschoten Email address: d_van_rietschoten@hotmail.com PostgreSQL version: 15.7 Operating system: Unknown (GCP managed) Description: We have a small database server (2 cores, 4 gigs). We performed an initial load, putting quite a bit of load on this database. The initial load is performed by external kubernetes pods. After a while, we saw the nr of connection to the database increased, until we ran out of connections. We found that one of the processes was stuck in waiting for IPC (for many hours). This transaction held a lock, causing other transactions to wait on that lock. It seems that after a while, the application doing the initial load abandoned the transaction / connection and created a new one. This new connection than hit the same lock, with the same outcome. We could reproduce with a somewhat larger database (4 CPU, 8 gigs). With 6 CPU, 12 gigs we did not run into this problem. Query used to find the stuck transactions: select * from pg_activity order by backend_start asc; Result: <null> <null> 21 <null> <null> <null> <Other> <null> <null> 2024-08-16 13:17:02.331128 <null> <null> <null> Timeout CheckpointWriteDelay <null> <Other> <Other> <null> checkpointer <null> <null> 22 <null> <null> <null> <Other> <null> <null> 2024-08-16 13:17:02.332034 <null> <null> <null> Activity BgWriterMain <null> <Other> <Other> <null> background writer <null> <null> 24 <null> <null> <null> <Other> <null> <null> 2024-08-16 13:17:02.372296 <null> <null> <null> Activity WalWriterMain <null> <Other> <Other> <null> walwriter <null> <null> 25 <null> <null> <null> <Other> <null> <null> 2024-08-16 13:17:02.373328 <null> <null> <null> Activity AutoVacuumMain <null> <Other> <Other> <null> autovacuum launcher <null> <null> 26 <null> <null> <null> <Other> <null> <null> 2024-08-16 13:17:02.374639 <null> <null> <null> Activity ArchiverMain <null> <Other> <Other> <null> archiver <null> <null> 27 <null> 10 cloudsqladmin <Other> <null> <null> 2024-08-16 13:17:02.374752 <null> <null> <null> Activity LogicalLauncherMain <null> <Other> <Other> <null> logical replication launcher 16384 cloudsqladmin 30 <null> 10 cloudsqladmin cloudsqlagent <Other> <null> 56364 2024-08-16 13:17:02.523891 <null> 2024-08-16 15:28:28.290619 2024-08-16 15:28:28.29065 Client ClientRead idle <Other> <Other> <null> ; client backend 16488 iss 140 <null> 16490 <I hid the username> PostgreSQL JDBC Driver <Other> <null> -1 2024-08-16 13:17:06.768402 2024-08-16 13:27:41.172003 2024-08-16 13:27:41.439554 2024-08-16 13:27:41.439557 IPC ParallelFinish active <Other> <Other> <null> select l1_0.inbound_shipment_id,l1_0.inbound_shipment_line_id,l1_0.ean,lc1_0.id,lc1_0.announcement_type,lc1_0.load_carrier_id,lc1_0.state,l1_0.quantity,l1_0.sku from inbound_shipment_lines l1_0 left join load_carriers lc1_0 on lc1_0.id=l1_0.load_carrier_id where l1_0.inbound_shipment_id=$1 client backend ... (there were other queries) I found that this process (140) was blocking other queries using pg_locks. These make sense though (this is the application logic). My question: Why did PID 140 never complete (we waited ~12 hours)? The worst case scenario has little (< 20.000 rows). I asked postgres for a query plan for a similar query, it comes up with: Gather (cost=1015.66..4488.84 rows=1376 width=104) (actual time=6.434..14.233 rows=4 loops=1) Workers Planned: 1 Workers Launched: 1 -> Nested Loop Left Join (cost=15.66..3351.24 rows=809 width=104) (actual time=2.583..2.587 rows=2 loops=2) -> Parallel Bitmap Heap Scan on inbound_shipment_lines l1_0 (cost=15.23..1400.35 rows=809 width=62) (actual time=2.574..2.576 rows=2 loops=2) Recheck Cond: (inbound_shipment_id = 6134742) Heap Blocks: exact=1 -> Bitmap Index Scan on inbound_shipment_lines_inbound_shipment_id_idx (cost=0.00..14.88 rows=1376 width=0) (actual time=3.551..3.551 rows=4 loops=1) Index Cond: (inbound_shipment_id = 6134742) -> Index Scan using load_carriers_pkey on load_carriers lc1_0 (cost=0.43..2.41 rows=1 width=46) (actual time=0.002..0.002 rows=0 loops=4) Index Cond: (id = l1_0.load_carrier_id) This takes very little time to complete (even for the worst case occurrence). Although we put (too?) much load on the database, I would not expect the process to hang. I'd also expect the transaction to be rolled back when the connection is closed. I tried stopping all clients, but the connections remained on the database server. With the 2 cores, 4 gigs system the database CPU usage was constant at 100%. With 4 cores, 8 gigs, it was constant around 95%. With 6 cores, 12 gigs at was a little over 60% I saw some similar posts (https://postgrespro.com/list/id/0f64b4c7fc200890f2055ce4d6650e9c2191fac2.camel@j-davis.com for instance, or https://www.postgresql.org/message-id/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local) I understand the issue is a bit vague, but it’s all I’ve got: The IPC process seems to get stuck under heavy load The connections / transactions were not closed after the client went away
В списке pgsql-bugs по дате отправления: