Обсуждение: Limit memory usage by postgres_fdw batches

Поиск
Список
Период
Сортировка

Limit memory usage by postgres_fdw batches

От
Alexander Pyhalov
Дата:
Hi.

We had some real cases when client set rather big batch_size on server 
level, but for some foreign table, containing large documents, it was 
inadequate and lead to OOM killer intervention. You can argue that 
batch_size can be set on foreign table level, but it can still be not 
flexible enough, when tuple size varies. I suppose this case is also 
takes place for fetch_size. Issue here is that we can't somehow limit 
size of data (versus number of rows) while fetching from cursor. But we 
can use tuple store to preserve fetched results, so that they spill out 
to the disk.

I'm attaching two patches which try to fix issues with possible huge 
memory usage by postgres_fdw batches.
With fetched tuples we still can't use only tuplestore, as ctids are not 
preserved, and so have to store them separately.

The reproducer for insert is simple.

create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw options (dbname 
'postgres', port '5432', batch_size '100', fetch_size '100');
create table base_table(i int, s bytea);
create foreign table foreign_table (i int, s bytea) server loopback 
options(table_name 'base_table');
create user mapping for public server loopback ;

insert into foreign_table select i, 
pg_read_binary_file('/some/big/file') from generate_series(1,1000) i;

will easily grow backend RSS to several gigabytes.
The first patch fixes this problem.

The second patch alleviates the second issue - SELECT * queries also can 
grow backend memory to several GBs. Still memory usage can peak (on my 
toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than 
non-patched version.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Limit memory usage by postgres_fdw batches

От
Tomas Vondra
Дата:
Hi Alexander,

On 12/26/25 11:54, Alexander Pyhalov wrote:
> Hi.
> 
> We had some real cases when client set rather big batch_size on server
> level, but for some foreign table, containing large documents, it was
> inadequate and lead to OOM killer intervention. You can argue that
> batch_size can be set on foreign table level, but it can still be not
> flexible enough, when tuple size varies.

I agree this can be an issue with large tuples. It'd be good to consider
the size, not just the number of tuples.

I think you modified the right places, but I think there are two or
three issues we should improve:

1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
for every tuples. And it walks all tuples up to that point, which makes
it O(N^2). I haven't measured how significant it is, but AFAICS we could
track the current size of the batch fairly easily, and use that.

2) work_mem is in kilobytes, while batch_len is in bytes, so the
comparison (batch_len > work_mem) is not quite right. I'll probably fire
every time, preventing any batching.

3) Isn't this consider the size of the new tuple in batch_len? Imagine
the tuples are 99% of the work_mem limit. We add the first one. When
adding the next one we check the current batch is below work_mem, and so
we proceed to add the second tuple. Now the batch is 1.98% of the limit.

I think it should work like this:

1) batch_len + tup_len < work_mem => add tuple to batch
2) tup_len < work_mem => flush batch, add tuple to batch
3) tup_len => work_mem => flush batch, insert tuple directly

What bothers me a little bit is that this is per relation. AFAICS when
inserting into a partitioned table with multiple foreign partitions,
each partition will have a separate limit. I wonder if we could do
better and have some sort of "global" limit for the whole insert.

But that's not the fault of this patch, of course.

> I suppose this case is also
> takes place for fetch_size. Issue here is that we can't somehow limit
> size of data (versus number of rows) while fetching from cursor. But we
> can use tuple store to preserve fetched results, so that they spill out
> to the disk.
> 

Perhaps. Seems like a separate issue. I haven't looked very closely, but
do we want to use the tuplestore always, or just when the tuples get too
large? It might even be on batch-by-batch, I guess. With fetch_size=1
it's hardly useful, right? Is the tuplestore management measurable?

> I'm attaching two patches which try to fix issues with possible huge
> memory usage by postgres_fdw batches.
> With fetched tuples we still can't use only tuplestore, as ctids are not
> preserved, and so have to store them separately.
> 
> The reproducer for insert is simple.
> 
> create extension postgres_fdw ;
> create server loopback foreign data wrapper postgres_fdw options (dbname
> 'postgres', port '5432', batch_size '100', fetch_size '100');
> create table base_table(i int, s bytea);
> create foreign table foreign_table (i int, s bytea) server loopback
> options(table_name 'base_table');
> create user mapping for public server loopback ;
> 
> insert into foreign_table select i, pg_read_binary_file('/some/big/
> file') from generate_series(1,1000) i;
> 
> will easily grow backend RSS to several gigabytes.
> The first patch fixes this problem.
> 
> The second patch alleviates the second issue - SELECT * queries also can
> grow backend memory to several GBs. Still memory usage can peak (on my
> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than non-
> patched version.
> 

How large are the tuples? How much higher was this RSS than the
theoretical minimum?


regards

-- 
Tomas Vondra




Re: Limit memory usage by postgres_fdw batches

От
Alexander Pyhalov
Дата:
Tomas Vondra писал(а) 2025-12-27 16:18:
> Hi Alexander,
> 
> On 12/26/25 11:54, Alexander Pyhalov wrote:
>> Hi.
>> 
>> We had some real cases when client set rather big batch_size on server
>> level, but for some foreign table, containing large documents, it was
>> inadequate and lead to OOM killer intervention. You can argue that
>> batch_size can be set on foreign table level, but it can still be not
>> flexible enough, when tuple size varies.
> 
> I agree this can be an issue with large tuples. It'd be good to 
> consider
> the size, not just the number of tuples.
> 
> I think you modified the right places, but I think there are two or
> three issues we should improve:
> 
> 1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
> for every tuples. And it walks all tuples up to that point, which makes
> it O(N^2). I haven't measured how significant it is, but AFAICS we 
> could
> track the current size of the batch fairly easily, and use that.
> 
> 2) work_mem is in kilobytes, while batch_len is in bytes, so the
> comparison (batch_len > work_mem) is not quite right. I'll probably 
> fire
> every time, preventing any batching.
> 
> 3) Isn't this consider the size of the new tuple in batch_len? Imagine
> the tuples are 99% of the work_mem limit. We add the first one. When
> adding the next one we check the current batch is below work_mem, and 
> so
> we proceed to add the second tuple. Now the batch is 1.98% of the 
> limit.
> 
> I think it should work like this:
> 
> 1) batch_len + tup_len < work_mem => add tuple to batch
> 2) tup_len < work_mem => flush batch, add tuple to batch
> 3) tup_len => work_mem => flush batch, insert tuple directly
> 
> What bothers me a little bit is that this is per relation. AFAICS when
> inserting into a partitioned table with multiple foreign partitions,
> each partition will have a separate limit. I wonder if we could do
> better and have some sort of "global" limit for the whole insert.
> 
> But that's not the fault of this patch, of course.

Hi. Thank you for feedback.
I've fixed the first two issues in the first patch.
I've tried to implement switching from batch to usual foreign insert 
(it's in the second patch). However, I don't like it. It changes the 
resultRelInfo->ri_NumSlots interface. Now it can be 0 when BatchInsert 
was followed by foreign insert, and ExecPendingInserts() should handle 
this. Also we can't longer keep fact that BatchInsert() was performed 
private to ExecInsert(), as we have to determine if 
estate->es_insert_pending_result_relations and 
estate->es_insert_pending_modifytables lists were modified between 
ExecInsert() calls. In theory we could look at 
list_member_ptr(estate->es_insert_pending_result_relations, 
resultRelInfo), but I don't know if iterating over this list per each 
tuple is a good idea.

> 
>> I suppose this case is also
>> takes place for fetch_size. Issue here is that we can't somehow limit
>> size of data (versus number of rows) while fetching from cursor. But 
>> we
>> can use tuple store to preserve fetched results, so that they spill 
>> out
>> to the disk.
>> 
> 
> Perhaps. Seems like a separate issue. I haven't looked very closely, 
> but
> do we want to use the tuplestore always, or just when the tuples get 
> too
> large? It might even be on batch-by-batch, I guess. With fetch_size=1
> it's hardly useful, right? Is the tuplestore management measurable?

I haven't noticed significant impact on simple pgbench tests. I will 
check different cases with
fetch_size 1  - for now I don't understand if we could trigger OOM for 
example with async foreign
scan of partitioned table with many foreign partitions - now we would 
have to store each tuple
we get in the memory, perhaps, saving it to the store until it's needed 
by Append is not so bad idea.

> 
>> I'm attaching two patches which try to fix issues with possible huge
>> memory usage by postgres_fdw batches.
>> With fetched tuples we still can't use only tuplestore, as ctids are 
>> not
>> preserved, and so have to store them separately.
>> 
>> The reproducer for insert is simple.
>> 
>> create extension postgres_fdw ;
>> create server loopback foreign data wrapper postgres_fdw options 
>> (dbname
>> 'postgres', port '5432', batch_size '100', fetch_size '100');
>> create table base_table(i int, s bytea);
>> create foreign table foreign_table (i int, s bytea) server loopback
>> options(table_name 'base_table');
>> create user mapping for public server loopback ;
>> 
>> insert into foreign_table select i, pg_read_binary_file('/some/big/
>> file') from generate_series(1,1000) i;
>> 
>> will easily grow backend RSS to several gigabytes.
>> The first patch fixes this problem.
>> 
>> The second patch alleviates the second issue - SELECT * queries also 
>> can
>> grow backend memory to several GBs. Still memory usage can peak (on my
>> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than 
>> non-
>> patched version.
>> 
> 
> How large are the tuples? How much higher was this RSS than the
> theoretical minimum?
> 

Sorry, have lost the exact reproducer. Memory consumption of course 
depends on tuple size and fetch size.

Remade some simple tests.


select * from pg_foreign_server ;
   oid  | srvname  | srvowner | srvfdw | srvtype | srvversion | srvacl |  
                       srvoptions

-------+----------+----------+--------+---------+------------+--------+-----------------------------------------------------------
  16403 | loopback |       10 |  16392 |         |            |        | 
{dbname=postgres,port=5432,batch_size=100,fetch_size=100}
(1 row)

select count(*) from foreign_table ;
  count
-------
   1100
(1 row)

  \d+ base_table
                                         Table "public.base_table"
  Column |  Type   | Collation | Nullable | Default | Storage  | 
Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
  i      | integer |           |          |         | plain    |          
    |              |
  s      | bytea   |           |          |         | extended |          
    |              |
Access method: heap

select pg_size_pretty(pg_total_relation_size('base_table')) ;
  pg_size_pretty
----------------
  5495 MB
(1 row)

postgres=# select 
pg_size_pretty(pg_total_relation_size('base_table')/1100) ;
  pg_size_pretty
----------------
  5116 kB
(1 row)

Backend, executing

"select  i, length(s) from foreign_table "

query takes up to 1.4 GB RSS on master and up to 1 GB on patched 
version.

P.S. Going to be on vacation for the first half of January.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения