Обсуждение: Postgresql 16.3 Out Of Memory

Поиск
Список
Период
Сортировка

Postgresql 16.3 Out Of Memory

От
Radu Radutiu
Дата:
Hello,

I have an out of memory problem after upgrading from postgresql 12 to 16.3. I have identified one query that can reproduce the error on demand. Once the query starts, it will eventually exhaust all RAM and swap until the OOM killer will stop postgresql.
The setup is as follows:
- One VMWARE VM, RHEL 8.9, 64 GB RAM, 16 vCPU, 32 GB swap. Database is used by a local java application that takes up to 22 GB (hard limit). The application uses 2 db connections (long running, app uses a connection pool). The database has about 10 tables, with 3 large tables involved in the problem query:
     relation       | total_size
-------------------+------------
 outputrequest | 217 GB
 inputrequest  | 149 GB
 tran          | 5041 MB

I have the following settings:
shared_buffers = '8192MB'
effective_cache_size = '24GB'
maintenance_work_mem = '2GB'
checkpoint_completion_target = '0.9'
wal_buffers = '16MB'
default_statistics_target = '100'
random_page_cost = '1.1'
effective_io_concurrency = '200'
work_mem = '104857kB'
min_wal_size = '1GB'
max_wal_size = '4GB'
max_worker_processes = '16'
max_parallel_workers_per_gather = '4'
max_parallel_workers = '16'
max_parallel_maintenance_workers = '4'
jit = 'off'

Beside OOM there is another (possibly related)problem: the postgresql memory usage is significantly higher than the shared_buffers + the rest of the memory used per connection. On my system with shared buffers 8G and work_mem ~ 100M I have memory usage of 20GB with the system running just simple inserts (the memory might have increased due to previous queries).

I have found out that if I set  enable_parallel_hash = 'off', I do get high memory usage for the problem query (above 30G) but not OOM. With the enable_parallel_hash = 'on', I will always get OOM. For simplicity I've restricted postgresql systemd service to cap the memory at 36 GB (so that I don't have to wait for the 32 GB swap to be filled before OOM).  This is the memory usage I get when running the query, before systemd restarts postgres - memory usage is 35.9 GB just below the 36 GB limit (the query is running with explain analyze,buffers, the other 2 idle connections are from the application):

[postgres@ips3 ~]$ systemctl status postgresql-16; top -u postgres -n 1 -c
● postgresql-16.service - PostgreSQL 16 database server
   Loaded: loaded (/etc/systemd/system/postgresql-16.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/postgresql-16.service.d
           └─override.conf
   Active: active (running) since Mon 2024-06-03 04:23:16 +08; 11h ago
     Docs: https://www.postgresql.org/docs/16/static/
  Process: 283703 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 283709 (postgres)
    Tasks: 14 (limit: 408404)
   Memory: 35.9G (limit: 36.0G)
   CGroup: /system.slice/postgresql-16.service
           ├─283709 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
           ├─283710 postgres: logger
           ├─295499 postgres: checkpointer
           ├─295500 postgres: background writer
           ├─295506 postgres: walwriter
           ├─295507 postgres: autovacuum launcher
           ├─295508 postgres: logical replication launcher
           ├─295511 postgres: db1 db1 127.0.0.1(37720) idle
           ├─295946 postgres: postgres db1 [local] EXPLAIN
           ├─295947 postgres: parallel worker for PID 295946
           ├─295948 postgres: parallel worker for PID 295946
           ├─295949 postgres: parallel worker for PID 295946
           ├─295950 postgres: parallel worker for PID 295946
           └─296037 postgres: db1 db1 127.0.0.1(41708) idle

top - 16:14:32 up 14 days,  4:42,  1 user,  load average: 5.41, 3.58, 2.25
Tasks: 405 total,   6 running, 399 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  5.2 sy,  0.0 ni, 71.2 id, 21.3 wa,  0.7 hi,  0.4 si,  0.0 st
MiB Mem :  63873.4 total,   2162.0 free,  57177.8 used,   4533.6 buff/cache
MiB Swap:  32260.0 total,   3813.8 free,  28446.2 used.   4349.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                          
 295947 postgres  20   0   21.2g   7.4g 654168 R  20.0  11.8   1:31.81 postgres: parallel worker for PID 295946                                                                                                                                          
 295949 postgres  20   0   21.3g   7.5g 656316 R  20.0  12.0   1:32.57 postgres: parallel worker for PID 295946                                                                                                                                          
 295950 postgres  20   0   21.2g   7.5g 654756 R  20.0  12.1   1:31.99 postgres: parallel worker for PID 295946                                                                                                                                          
 295946 postgres  20   0   21.2g   7.4g 655684 R  13.3  11.8   1:31.86 postgres: postgres db1 [local] EXPLAIN                                                                                                                                        
 295948 postgres  20   0   21.6g   6.0g 656024 R  13.3   9.6   1:37.91 postgres: parallel worker for PID 295946                                                                                                                                          
 283709 postgres  20   0 8960628  44144  43684 S   0.0   0.1   0:42.28 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/                                                                                                                            
 283710 postgres  20   0  355288   4876   4576 S   0.0   0.0   0:04.42 postgres: logger                                                                              
 295499 postgres  20   0 8960936 107376 106868 S   0.0   0.2   0:01.75 postgres: checkpointer                                                                                                                                                            
 295500 postgres  20   0 8960764   7764   7364 S   0.0   0.0   0:00.08 postgres: background writer                                                                                                                                                      
 295506 postgres  20   0 8960764  22008  21624 S   0.0   0.0   0:00.21 postgres: walwriter                                                                                                                                                              
 295507 postgres  20   0 8962216   6960   6496 S   0.0   0.0   0:00.00 postgres: autovacuum launcher                                                                                                                                                    
 295508 postgres  20   0 8962232   7436   7060 S   0.0   0.0   0:00.00 postgres: logical replication launcher                                                                                                                                            
 295511 postgres  20   0 8974684 133608 131104 S   0.0   0.2   0:23.90 postgres: db1 db1 127.0.0.1(37720) idle                                                                                                                                    
 295945 postgres  20   0  274168   6236   5524 S   0.0   0.0   0:00.00 psql -f test.sql db1                                                                                                                                                          
 296037 postgres  20   0 8962736  13972  13684 S   0.0   0.0   0:00.00 postgres: db1 db1 127.0.0.1(41708) idle    
 
 
 Below is the execution plan that gives OOM:
       
 
                                                                                                                                QUERY PLAN                                                                                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=41247490.48..45605473.14 rows=36397008 width=66)
   Workers Planned: 4
   ->  Sort  (cost=41246490.42..41269238.55 rows=9099252 width=66)
         Sort Key: t.msg_status DESC, t.input_sequence
         ->  Parallel Hash Right Join  (cost=33197712.34..39496299.38 rows=9099252 width=66)
               Hash Cond: (snd_tro.reply_input_sequence = t.input_sequence)
               ->  Parallel Seq Scan on outputrequest snd_tro  (cost=0.00..5056151.17 rows=19783917 width=16)
               ->  Parallel Hash  (cost=32856685.24..32856685.24 rows=6148088 width=58)
                     ->  Parallel Hash Right Join  (cost=26706855.23..32856685.24 rows=6148088 width=58)
                           Hash Cond: (rec_tro.input_sequence = t.input_sequence)
                           ->  Parallel Seq Scan on outputrequest rec_tro  (cost=0.00..5056151.17 rows=19783917 width=16)
                           ->  Parallel Hash  (cost=26492657.28..26492657.28 rows=4154076 width=50)
                                 ->  Parallel Hash Right Join  (cost=20364128.48..26492657.28 rows=4154076 width=50)
                                       Hash Cond: (rpl_snd_tro.reply_input_sequence = r.input_sequence)
                                       ->  Parallel Seq Scan on outputrequest rpl_snd_tro  (cost=0.00..5056151.17 rows=19783917 width=16)
                                       ->  Parallel Hash  (cost=20166158.53..20166158.53 rows=4154076 width=42)
                                             ->  Parallel Hash Right Join  (cost=14048784.52..20166158.53 rows=4154076 width=42)
                                                   Hash Cond: (rpl_rec_tro.input_sequence = r.input_sequence)
                                                   ->  Parallel Seq Scan on outputrequest rpl_rec_tro  (cost=0.00..5056151.17 rows=19783917 width=16)
                                                   ->  Parallel Hash  (cost=13867042.57..13867042.57 rows=4154076 width=34)
                                                         ->  Parallel Hash Right Join  (cost=6730983.28..13867042.57 rows=4154076 width=34)
                                                               Hash Cond: (r.originalrequest_id = t.input_sequence)
                                                               ->  Parallel Seq Scan on inputrequest r  (cost=0.00..6465209.90 rows=13367390 width=16)
                                                               ->  Parallel Hash  (cost=6565465.33..6565465.33 rows=4154076 width=26)
                                                                     ->  Parallel Seq Scan on inputrequest t  (cost=0.00..6565465.33 rows=4154076 width=26)
                                                                           Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))
(26 rows)

 
This is the execution plan with  enable_parallel_hash = 'off' which runs but still uses more than 30GB RAM:

 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=51681038.73..56036969.38 rows=36379870 width=66) (actual time=68534.468..72489.115 rows=14895689 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   Buffers: shared hit=212733952 read=3252691 dirtied=28327 written=35494, temp read=1350681 written=1350691
   ->  Sort  (cost=51680038.67..51702776.09 rows=9094968 width=66) (actual time=68352.387..68953.103 rows=2979138 loops=5)
         Sort Key: t.msg_status DESC, t.input_sequence
         Sort Method: external merge  Disk: 179448kB
         Buffers: shared hit=212733952 read=3252691 dirtied=28327 written=35494, temp read=1350681 written=1350691
         Worker 0:  Sort Method: external merge  Disk: 177528kB
         Worker 1:  Sort Method: external merge  Disk: 178216kB
         Worker 2:  Sort Method: external merge  Disk: 177272kB
         Worker 3:  Sort Method: external merge  Disk: 176248kB
         ->  Hash Left Join  (cost=8179839.27..49930701.19 rows=9094968 width=66) (actual time=25952.805..64809.594 rows=2979138 loops=5)
               Hash Cond: (t.input_sequence = rec_tro.input_sequence)
               Buffers: shared hit=212733888 read=3252691 dirtied=28327 written=35494, temp read=1239592 written=1239592
               ->  Nested Loop Left Join  (cost=2.27..38995500.02 rows=6144595 width=58) (actual time=1.541..27287.533 rows=2979138 loops=5)
                     Buffers: shared hit=207879681 read=2037114 dirtied=1546 written=35418
                     ->  Nested Loop Left Join  (cost=1.70..30880205.10 rows=4151312 width=50) (actual time=1.236..21739.231 rows=2979138 loops=5)
                           Buffers: shared hit=140861726 read=2004019 dirtied=1546 written=34842
                           ->  Nested Loop Left Join  (cost=1.13..22788518.11 rows=4151312 width=42) (actual time=1.082..18110.990 rows=2979138 loops=5)
                                 Buffers: shared hit=103757087 read=1971773 dirtied=1546 written=33891
                                 ->  Nested Loop Left Join  (cost=0.56..14665384.33 rows=4151312 width=34) (actual time=0.718..14005.092 rows=2979138 loops=5)
                                       Buffers: shared hit=66987152 read=1604817 dirtied=1546 written=20820
                                       ->  Parallel Seq Scan on inputrequest t  (cost=0.00..6561097.71 rows=4151312 width=26) (actual time=0.291..4167.263 rows=2979138 loops=5)
                                             Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))
                                             Rows Removed by Filter: 7555153
                                             Buffers: shared hit=693060 read=888771 dirtied=1546 written=395
                                       ->  Index Scan using inputrequest_originalrequest_id_idx on inputrequest r  (cost=0.56..1.94 rows=1 width=16) (actual time=0.003..0.003 rows=0 loops=14895689)
                                             Index Cond: (originalrequest_id = t.input_sequence)
                                             Buffers: shared hit=66294092 read=716046 written=20425
                                 ->  Index Scan using outputrequest_input_sequence_idx on outputrequest rpl_rec_tro  (cost=0.57..1.95 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=14895689)
                                       Index Cond: (input_sequence = r.input_sequence)
                                       Buffers: shared hit=36769935 read=366956 written=13071
                           ->  Index Scan using outputrequest_reply_input_sequence_idx on outputrequest rpl_snd_tro  (cost=0.57..1.94 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=14895689)
                                 Index Cond: (reply_input_sequence = r.input_sequence)
                                 Buffers: shared hit=37104639 read=32246 written=951
                     ->  Index Scan using outputrequest_reply_input_sequence_idx on outputrequest snd_tro  (cost=0.57..1.94 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=14895689)
                           Index Cond: (reply_input_sequence = t.input_sequence)
                           Buffers: shared hit=67017955 read=33095 written=576
               ->  Hash  (cost=5646459.11..5646459.11 rows=79090711 width=16) (actual time=25930.971..25930.972 rows=52601291 loops=5)
                     Buckets: 4194304  Batches: 32  Memory Usage: 109766kB
                     Buffers: shared hit=4853863 read=1215577 dirtied=26781 written=76, temp written=1119615
                     ->  Seq Scan on outputrequest rec_tro  (cost=0.00..5646459.11 rows=79090711 width=16) (actual time=0.114..17424.298 rows=79099765 loops=5)
                           Buffers: shared hit=4853863 read=1215577 dirtied=26781 written=76
 Planning:
   Buffers: shared hit=328 read=12
 Planning Time: 3.861 ms
 Execution Time: 73323.032 ms
(48 rows)

After successfully running the query, the postgresql memory usage is 
● postgresql-16.service - PostgreSQL 16 database server
   Loaded: loaded (/etc/systemd/system/postgresql-16.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/postgresql-16.service.d
           └─override.conf
   Active: active (running) since Mon 2024-06-03 04:23:16 +08; 13h ago
     Docs: https://www.postgresql.org/docs/16/static/
  Process: 283703 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 283709 (postgres)
    Tasks: 9 (limit: 408404)
   Memory: 31.3G (limit: 36.0G)
   CGroup: /system.slice/postgresql-16.service
           ├─283709 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
           ├─283710 postgres: logger
           ├─296122 postgres: checkpointer
           ├─296123 postgres: background writer
           ├─296128 postgres: walwriter
           ├─296129 postgres: autovacuum launcher
           ├─296130 postgres: logical replication launcher
           ├─296133 postgres: db1 db1 127.0.0.1(52340) idle
           └─297725 postgres: db1 db1 127.0.0.1(33216) idle

top - 18:09:44 up 14 days,  6:37,  1 user,  load average: 0.01, 0.07, 0.08
Tasks: 394 total,   2 running, 392 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.7 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :  63873.4 total,   3798.3 free,  22165.3 used,  37909.8 buff/cache
MiB Swap:  32260.0 total,  31045.3 free,   1214.7 used.  32600.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                          
 297728 postgres  20   0  264520   4844   3808 R   6.2   0.0   0:00.02 top -u postgres -n 1 -c                                                                                                                                                          
 283709 postgres  20   0 8960628 198604 197992 S   0.0   0.3   1:23.66 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/                                                                                                                            
 283710 postgres  20   0  355288   4808   4576 S   0.0   0.0   0:04.43 postgres: logger                                                                                                                                                                  
 294897 postgres  20   0  226848   5936   3296 S   0.0   0.0   0:00.18 -bash                                                                                                                                                                            
 296122 postgres  20   0 8960912   2.5g   2.5g S   0.0   4.0   0:03.75 postgres: checkpointer                                                                                                                                                            
 296123 postgres  20   0 8960764 161884 161160 S   0.0   0.2   0:00.41 postgres: background writer                                                                                                                                                      
 296128 postgres  20   0 8960764  22404  21748 S   0.0   0.0   0:00.49 postgres: walwriter                                                                                                                                                              
 296129 postgres  20   0 8962216   7800   6872 S   0.0   0.0   0:00.00 postgres: autovacuum launcher                                                                                                                                                    
 296130 postgres  20   0 8962232   7784   6884 S   0.0   0.0   0:00.00 postgres: logical replication launcher                                                                                                                                            
 296133 postgres  20   0 8964788   2.2g   2.2g S   0.0   3.6   1:29.51 postgres: db1 db1 127.0.0.1(52340) idle                                                                                                                                    
 297725 postgres  20   0 8962736  17620  16076 S   0.0   0.0   0:00.00 postgres: db1 db1 127.0.0.1(33216) idle      

Does anyone else see higher than expected memory usage reported by systemctl status?
Do you have any idea how to further debug the problem? 

Best Regards,
Radu 

Re: Postgresql 16.3 Out Of Memory

От
Greg Sabino Mullane
Дата:
On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu <rradutiu@gmail.com> wrote: 
Do you have any idea how to further debug the problem? 

Putting aside the issue of non-reclaimed memory for now, can you show us the actual query? The explain analyze you provided shows it doing an awful lot of joins and then returning 14+ million rows to the client. Surely the client does not need that many rows?

You can probably also prevent OOM by lowering your work_mem and/or your max_parallel_workers_per_gather.

Cheers,
Greg

Re: Postgresql 16.3 Out Of Memory

От
Ron Johnson
Дата:
On Mon, Jun 3, 2024 at 9:12 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu <rradutiu@gmail.com> wrote: 
Do you have any idea how to further debug the problem? 

Putting aside the issue of non-reclaimed memory for now, can you show us the actual query? The explain analyze you provided shows it doing an awful lot of joins and then returning 14+ million rows to the client. Surely the client does not need that many rows?

And the query cost is really high.  "Did you ANALYZE the instance after conversion?" was my first question.

Re: Postgresql 16.3 Out Of Memory

От
Radu Radutiu
Дата:
It looks like I did not copy the list. 

I did run VACUUM ANALYZE after the upgrade. I've even run it now before getting a new explain plan that is very similar (just the costs differ):
 Gather Merge  (cost=12336145.92..16111570.23 rows=31531596 width=66)
   Workers Planned: 4
   ->  Sort  (cost=12335145.86..12354853.11 rows=7882899 width=66)
         Sort Key: t.msg_status DESC, t.input_sequence
         ->  Parallel Hash Right Join  (cost=9376528.66..11242773.26 rows=7882899 width=66)
               Hash Cond: (snd_tro.reply_input_sequence = t.input_sequence)
               ->  Parallel Seq Scan on tbl_outputrequest snd_tro  (cost=0.00..1431919.45 rows=20057645 width=16)
               ->  Parallel Hash  (cost=9254599.76..9254599.76 rows=5245992 width=58)
                     ->  Parallel Hash Right Join  (cost=7458636.44..9254599.76 rows=5245992 width=58)
                           Hash Cond: (rec_tro.input_sequence = t.input_sequence)
                           ->  Parallel Seq Scan on tbl_outputrequest rec_tro  (cost=0.00..1431919.45 rows=20057645 width=16)
                           ->  Parallel Hash  (cost=7380902.99..7380902.99 rows=3491156 width=50)
                                 ->  Parallel Hash Right Join  (cost=5592677.17..7380902.99 rows=3491156 width=50)
                                       Hash Cond: (rpl_snd_tro.reply_input_sequence = r.input_sequence)
                                       ->  Parallel Seq Scan on tbl_outputrequest rpl_snd_tro  (cost=0.00..1431919.45 rows=20057645 width=16)
                                       ->  Parallel Hash  (cost=5518353.72..5518353.72 rows=3491156 width=42)
                                             ->  Parallel Hash Right Join  (cost=3729209.40..5518353.72 rows=3491156 width=42)
                                                   Hash Cond: (rpl_rec_tro.input_sequence = r.input_sequence)
                                                   ->  Parallel Seq Scan on tbl_outputrequest rpl_rec_tro  (cost=0.00..1431919.45 rows=20057645 width=16)
                                                   ->  Parallel Hash  (cost=3658294.95..3658294.95 rows=3491156 width=34)
                                                         ->  Parallel Hash Right Join  (cost=1883503.35..3658294.95 rows=3491156 width=34)
                                                               Hash Cond: (r.originalrequest_id = t.input_sequence)
                                                               ->  Parallel Seq Scan on tbl_inputrequest r  (cost=0.00..1739752.66 rows=13348166 width=16)
                                                               ->  Parallel Hash  (cost=1839863.91..1839863.91 rows=3491156 width=26)
                                                                     ->  Parallel Seq Scan on tbl_inputrequest t  (cost=0.00..1839863.91 rows=3491156 width=26)
                                                                           Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))

The query cost is high and it returns a significant number of rows. However it should not consume 64+ GB RAM with the default enable_parallel_hash = 'on' when my shared_buffers is 8GB. The temporary fix I've implemented to get the system working is a change in the application configuration so that the timestamp filter selects fewer rows (3x less), together with setting enable_parallel_hash = 'off'. PostgreSQL service memory usage grows and stays over 20GB even with this setting. I'd like to find out exactly what causes the high memory usage as we have other projects using PostgreSQL and they are scheduled for upgrade from v12.  

My test.sql looks like this (application uses prepared statements, the two set operations are to revert for the current session the already implemented fixes):

prepare my_query (timestamp,bigint) as SELECT  t.input_sequence, rec_tro.output_sequence, r.input_sequence, rpl_rec_tro.output_sequence, rpl_snd_tro.output_sequence, snd_tro.output_sequence, t.msg_type  FROM inputrequest t  LEFT JOIN outputrequest rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inputrequest r ON r.originalRequest_id = t.input_sequence   LEFT JOIN outputrequest rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence  LEFT JOIN outputrequest rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence  LEFT JOIN outputrequest snd_tro ON snd_tro.reply_input_sequence = t.input_sequence  WHERE t.receive_time < $1 AND t.input_sequence < $2  AND t.msg_status IN ('COMPLETED', 'REJECTED')  ORDER BY t.msg_status DESC, t.input_sequence ;

EXPLAIN  EXECUTE my_query('2024-05-17 00:00:00', 202406020168279904);

I have an explanation for what I suspected was a memory leak. It seems that systemd reports cached memory, the relevant part is probably rss from  /sys/fs/cgroup/memory/system.slice/postgresql-16.service/memory.stat . This still leaves the out of memory situation.
I've managed to get a dump of the memory contexts for the 5 processes (explain + 4 parallel workers) while the query is starting to use significant memory (>10G). I've attached it in case anyone familiar with postgresql internals can see something obviously wrong with the reported numbers. 

Radu

Вложения