Обсуждение: Limit memory usage by postgres_fdw batches

Поиск

Список

Период

Сортировка

Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

26 декабря 2025 г., 13:54:32

Hi.

We had some real cases when client set rather big batch_size on server 
level, but for some foreign table, containing large documents, it was 
inadequate and lead to OOM killer intervention. You can argue that 
batch_size can be set on foreign table level, but it can still be not 
flexible enough, when tuple size varies. I suppose this case is also 
takes place for fetch_size. Issue here is that we can't somehow limit 
size of data (versus number of rows) while fetching from cursor. But we 
can use tuple store to preserve fetched results, so that they spill out 
to the disk.

I'm attaching two patches which try to fix issues with possible huge 
memory usage by postgres_fdw batches.
With fetched tuples we still can't use only tuplestore, as ctids are not 
preserved, and so have to store them separately.

The reproducer for insert is simple.

create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw options (dbname 
'postgres', port '5432', batch_size '100', fetch_size '100');
create table base_table(i int, s bytea);
create foreign table foreign_table (i int, s bytea) server loopback 
options(table_name 'base_table');
create user mapping for public server loopback ;

insert into foreign_table select i, 
pg_read_binary_file('/some/big/file') from generate_series(1,1000) i;

will easily grow backend RSS to several gigabytes.
The first patch fixes this problem.

The second patch alleviates the second issue - SELECT * queries also can 
grow backend memory to several GBs. Still memory usage can peak (on my 
toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than 
non-patched version.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вложения

Re: Limit memory usage by postgres_fdw batches

От

Tomas Vondra

Дата:

27 декабря 2025 г., 16:18:56

Hi Alexander,

On 12/26/25 11:54, Alexander Pyhalov wrote:
> Hi.
> 
> We had some real cases when client set rather big batch_size on server
> level, but for some foreign table, containing large documents, it was
> inadequate and lead to OOM killer intervention. You can argue that
> batch_size can be set on foreign table level, but it can still be not
> flexible enough, when tuple size varies.

I agree this can be an issue with large tuples. It'd be good to consider
the size, not just the number of tuples.

I think you modified the right places, but I think there are two or
three issues we should improve:

1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
for every tuples. And it walks all tuples up to that point, which makes
it O(N^2). I haven't measured how significant it is, but AFAICS we could
track the current size of the batch fairly easily, and use that.

2) work_mem is in kilobytes, while batch_len is in bytes, so the
comparison (batch_len > work_mem) is not quite right. I'll probably fire
every time, preventing any batching.

3) Isn't this consider the size of the new tuple in batch_len? Imagine
the tuples are 99% of the work_mem limit. We add the first one. When
adding the next one we check the current batch is below work_mem, and so
we proceed to add the second tuple. Now the batch is 1.98% of the limit.

I think it should work like this:

1) batch_len + tup_len < work_mem => add tuple to batch
2) tup_len < work_mem => flush batch, add tuple to batch
3) tup_len => work_mem => flush batch, insert tuple directly

What bothers me a little bit is that this is per relation. AFAICS when
inserting into a partitioned table with multiple foreign partitions,
each partition will have a separate limit. I wonder if we could do
better and have some sort of "global" limit for the whole insert.

But that's not the fault of this patch, of course.

> I suppose this case is also
> takes place for fetch_size. Issue here is that we can't somehow limit
> size of data (versus number of rows) while fetching from cursor. But we
> can use tuple store to preserve fetched results, so that they spill out
> to the disk.
> 

Perhaps. Seems like a separate issue. I haven't looked very closely, but
do we want to use the tuplestore always, or just when the tuples get too
large? It might even be on batch-by-batch, I guess. With fetch_size=1
it's hardly useful, right? Is the tuplestore management measurable?

> I'm attaching two patches which try to fix issues with possible huge
> memory usage by postgres_fdw batches.
> With fetched tuples we still can't use only tuplestore, as ctids are not
> preserved, and so have to store them separately.
> 
> The reproducer for insert is simple.
> 
> create extension postgres_fdw ;
> create server loopback foreign data wrapper postgres_fdw options (dbname
> 'postgres', port '5432', batch_size '100', fetch_size '100');
> create table base_table(i int, s bytea);
> create foreign table foreign_table (i int, s bytea) server loopback
> options(table_name 'base_table');
> create user mapping for public server loopback ;
> 
> insert into foreign_table select i, pg_read_binary_file('/some/big/
> file') from generate_series(1,1000) i;
> 
> will easily grow backend RSS to several gigabytes.
> The first patch fixes this problem.
> 
> The second patch alleviates the second issue - SELECT * queries also can
> grow backend memory to several GBs. Still memory usage can peak (on my
> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than non-
> patched version.
> 

How large are the tuples? How much higher was this RSS than the
theoretical minimum?

regards

-- 
Tomas Vondra

Re: Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

30 декабря 2025 г., 12:02:00

Tomas Vondra писал(а) 2025-12-27 16:18:
> Hi Alexander,
> 
> On 12/26/25 11:54, Alexander Pyhalov wrote:
>> Hi.
>> 
>> We had some real cases when client set rather big batch_size on server
>> level, but for some foreign table, containing large documents, it was
>> inadequate and lead to OOM killer intervention. You can argue that
>> batch_size can be set on foreign table level, but it can still be not
>> flexible enough, when tuple size varies.
> 
> I agree this can be an issue with large tuples. It'd be good to 
> consider
> the size, not just the number of tuples.
> 
> I think you modified the right places, but I think there are two or
> three issues we should improve:
> 
> 1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
> for every tuples. And it walks all tuples up to that point, which makes
> it O(N^2). I haven't measured how significant it is, but AFAICS we 
> could
> track the current size of the batch fairly easily, and use that.
> 
> 2) work_mem is in kilobytes, while batch_len is in bytes, so the
> comparison (batch_len > work_mem) is not quite right. I'll probably 
> fire
> every time, preventing any batching.
> 
> 3) Isn't this consider the size of the new tuple in batch_len? Imagine
> the tuples are 99% of the work_mem limit. We add the first one. When
> adding the next one we check the current batch is below work_mem, and 
> so
> we proceed to add the second tuple. Now the batch is 1.98% of the 
> limit.
> 
> I think it should work like this:
> 
> 1) batch_len + tup_len < work_mem => add tuple to batch
> 2) tup_len < work_mem => flush batch, add tuple to batch
> 3) tup_len => work_mem => flush batch, insert tuple directly
> 
> What bothers me a little bit is that this is per relation. AFAICS when
> inserting into a partitioned table with multiple foreign partitions,
> each partition will have a separate limit. I wonder if we could do
> better and have some sort of "global" limit for the whole insert.
> 
> But that's not the fault of this patch, of course.

Hi. Thank you for feedback.
I've fixed the first two issues in the first patch.
I've tried to implement switching from batch to usual foreign insert 
(it's in the second patch). However, I don't like it. It changes the 
resultRelInfo->ri_NumSlots interface. Now it can be 0 when BatchInsert 
was followed by foreign insert, and ExecPendingInserts() should handle 
this. Also we can't longer keep fact that BatchInsert() was performed 
private to ExecInsert(), as we have to determine if 
estate->es_insert_pending_result_relations and 
estate->es_insert_pending_modifytables lists were modified between 
ExecInsert() calls. In theory we could look at 
list_member_ptr(estate->es_insert_pending_result_relations, 
resultRelInfo), but I don't know if iterating over this list per each 
tuple is a good idea.

> 
>> I suppose this case is also
>> takes place for fetch_size. Issue here is that we can't somehow limit
>> size of data (versus number of rows) while fetching from cursor. But 
>> we
>> can use tuple store to preserve fetched results, so that they spill 
>> out
>> to the disk.
>> 
> 
> Perhaps. Seems like a separate issue. I haven't looked very closely, 
> but
> do we want to use the tuplestore always, or just when the tuples get 
> too
> large? It might even be on batch-by-batch, I guess. With fetch_size=1
> it's hardly useful, right? Is the tuplestore management measurable?

I haven't noticed significant impact on simple pgbench tests. I will 
check different cases with
fetch_size 1  - for now I don't understand if we could trigger OOM for 
example with async foreign
scan of partitioned table with many foreign partitions - now we would 
have to store each tuple
we get in the memory, perhaps, saving it to the store until it's needed 
by Append is not so bad idea.

> 
>> I'm attaching two patches which try to fix issues with possible huge
>> memory usage by postgres_fdw batches.
>> With fetched tuples we still can't use only tuplestore, as ctids are 
>> not
>> preserved, and so have to store them separately.
>> 
>> The reproducer for insert is simple.
>> 
>> create extension postgres_fdw ;
>> create server loopback foreign data wrapper postgres_fdw options 
>> (dbname
>> 'postgres', port '5432', batch_size '100', fetch_size '100');
>> create table base_table(i int, s bytea);
>> create foreign table foreign_table (i int, s bytea) server loopback
>> options(table_name 'base_table');
>> create user mapping for public server loopback ;
>> 
>> insert into foreign_table select i, pg_read_binary_file('/some/big/
>> file') from generate_series(1,1000) i;
>> 
>> will easily grow backend RSS to several gigabytes.
>> The first patch fixes this problem.
>> 
>> The second patch alleviates the second issue - SELECT * queries also 
>> can
>> grow backend memory to several GBs. Still memory usage can peak (on my
>> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than 
>> non-
>> patched version.
>> 
> 
> How large are the tuples? How much higher was this RSS than the
> theoretical minimum?
> 

Sorry, have lost the exact reproducer. Memory consumption of course 
depends on tuple size and fetch size.

Remade some simple tests.


select * from pg_foreign_server ;
   oid  | srvname  | srvowner | srvfdw | srvtype | srvversion | srvacl |  
                       srvoptions

-------+----------+----------+--------+---------+------------+--------+-----------------------------------------------------------
  16403 | loopback |       10 |  16392 |         |            |        | 
{dbname=postgres,port=5432,batch_size=100,fetch_size=100}
(1 row)

select count(*) from foreign_table ;
  count
-------
   1100
(1 row)

  \d+ base_table
                                         Table "public.base_table"
  Column |  Type   | Collation | Nullable | Default | Storage  | 
Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
  i      | integer |           |          |         | plain    |          
    |              |
  s      | bytea   |           |          |         | extended |          
    |              |
Access method: heap

select pg_size_pretty(pg_total_relation_size('base_table')) ;
  pg_size_pretty
----------------
  5495 MB
(1 row)

postgres=# select 
pg_size_pretty(pg_total_relation_size('base_table')/1100) ;
  pg_size_pretty
----------------
  5116 kB
(1 row)

Backend, executing

"select  i, length(s) from foreign_table "

query takes up to 1.4 GB RSS on master and up to 1 GB on patched 
version.

P.S. Going to be on vacation for the first half of January.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вложения

Re: Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

13 января, 13:44:48

Hi.

I've looked at the third patch more and found some evident issues.
1) While using tuplestore we get too much garbage from tuple conversion, 
which was not cleared properly. Tried to fix it, but now we come to the 
second problem.

2) While receiving tuples we've already allocated memory in libpq, so it 
can be too late to do something. In simple examples I can see libpq 
result sets which have several GB size  (and likely are not limited by 
anything).

I've used the attached script to create some tables.

select create_partitioned_rel('t1', 128, true, 1);
select create_partitioned_rel('t2', 4, true, 100);

insert into t1 select i, pg_read_binary_file('/some/100mb/file')  from 
generate_series(1,128) i;
insert into t2 select i, pg_read_binary_file('/some/100mb/file')  from 
generate_series(1,128) i;

And with simple queries like

select sum(length(s)) from TABLE;

we can see that backend can easily consume up to several GBs of RAM.

For now I start thinking we need some form of FETCH, which stops 
fetching data based on batch size...
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вложения

Re: Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

23 января, 09:18:29

Alexander Pyhalov писал(а) 2026-01-13 13:44:
> For now I start thinking we need some form of FETCH, which stops 
> fetching data based on batch size...

Hi.

To limit memory consumption, we actually have to retreive less data. And 
we can do it only on the side of the foreign server. I've rewritten the 
third patch. We introduce a new parameter - cursor_fetch_limit, which is 
set by postgres_fdw. When it is set, fetching limited count of records 
from the cursor is also limited by memory consumed by the records. Of 
course, record size is some estimation (for example, we don't know what 
out function will do).

This works as expected - in my tests with tables of large records, 
backends, executing selects, were always restricted by about 2 GB of RAM 
overall (without patch memory consumption easily grows up to 8 GB). 
However, now when we got less tuples from executor, than expected, we 
should recheck, if these are all tuples we can get. I've introduced 
es_eof EState field to signal that there's no more tuples. Don't know if 
it's the best way.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вложения

Re: Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

29 января, 17:24:15

Alexander Pyhalov писал(а) 2026-01-23 09:18:
> Alexander Pyhalov писал(а) 2026-01-13 13:44:
>> For now I start thinking we need some form of FETCH, which stops 
>> fetching data based on batch size...
> 
> Hi.
> 
> To limit memory consumption, we actually have to retreive less data. 
> And we can do it only on the side of the foreign server. I've rewritten 
> the third patch. We introduce a new parameter - cursor_fetch_limit, 
> which is set by postgres_fdw. When it is set, fetching limited count of 
> records from the cursor is also limited by memory consumed by the 
> records. Of course, record size is some estimation (for example, we 
> don't know what out function will do).
> 
> This works as expected - in my tests with tables of large records, 
> backends, executing selects, were always restricted by about 2 GB of 
> RAM overall (without patch memory consumption easily grows up to 8 GB). 
> However, now when we got less tuples from executor, than expected, we 
> should recheck, if these are all tuples we can get. I've introduced 
> es_eof EState field to signal that there's no more tuples. Don't know 
> if it's the best way.

The issue which worried me is that we could change behavior for cursors, 
which don't expect this (for example, if these cursors are used in 
functions, which are executed via FDW). So I added new cursor option and 
used it in postgres_fdw.

Also rearranged patches. While working with test cases I found that we 
should avoid batch insert for big tuples, as it has significant 
overhead. Switching from (limited) batch insert to foreign insert 
allowed to decrease memory consumption in 3 times in some tests.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вложения

Re: Limit memory usage by postgres_fdw batches

От

Alexander Pyhalov

Дата:

10 февраля, 16:31:12

Hi.

Attaching new version of patches. Besides some bit-size fixes, last 
series of patches managed to limit MOVE operations. Fixed this.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Limit memory usage by postgres_fdw batches

Limit memory usage by postgres_fdw batches

Вложения

Re: Limit memory usage by postgres_fdw batches

Re: Limit memory usage by postgres_fdw batches

Вложения

Re: Limit memory usage by postgres_fdw batches

Вложения

Re: Limit memory usage by postgres_fdw batches

Вложения

Re: Limit memory usage by postgres_fdw batches

Вложения

Re: Limit memory usage by postgres_fdw batches

Вложения