Обсуждение: FDW connection drops with "Connection timed out" during async append query due to TCP receive buffer filling up

Поиск
Список
Период
Сортировка
Hello:

I encountered an issue with foreign table asynchronous scans.

I'm testing on PostgreSQL 17 in a Windows WSL environment, with TCP keepalive settings configured as follows:
configure:
tcp_keepalives_idle = 2
tcp_keepalives_interval = 2
tcp_keepalives_count = 3
tcp_user_timeout = 9

The foreign table was created with:
sql:
CREATE TABLE ftab (a int, data varchar);
I inserted test data using:
sql:
INSERT INTO ftab
SELECT generate_series(1, 10000),
       substring(md5(random()::text) || repeat(md5(random()::text), 127), 1, 40960);
Then I executed an async append query that unions with a simple query:
sql:
SELECT f1(), 'local-data'
UNION ALL
SELECT * FROM f_ftab
LIMIT 2;
Note: f1() is a function that sleeps for 10 seconds, created by:
sql
CREATE OR REPLACE FUNCTION f1()
RETURNS integer
LANGUAGE plpgsql AS $$
BEGIN
    PERFORM pg_sleep(10);
    RETURN 1;
END; $$;
The abnormal behavior was reproduced: the foreign data server connection was disconnected, with the following error logs:
text
2026-03-10 17:52:55.969 CST [820] LOG: statement: FETCH 100 FROM c1
2026-03-10 17:52:56.495 CST [820] LOG: could not receive data from client: Connection timed out
2026-03-10 17:52:56.495 CST [820] LOG: unexpected EOF on client connection with an open transaction
The async foreign table query failed with the error:
text
ERROR: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
invalid socket
CONTEXT: remote SQL command: SELECT a, data FROM public.ftab
WARNING: no connection to the server
This is a minimal working example. In practice, if the local table scan takes too long and the foreign table has sufficiently wide rows, this issue may reproduce.
In my understanding, when performing a local sequential scan, the PostgreSQL backend fetches data from the local plan without fetching any data from the FDW. As a result, the TCP receive buffer may become full, causing the FDW connection to be disconnected.
I believe this is a minor issue. How can I resolve this problem?
On Tue, 2026-03-10 at 18:17 +0800, jiye wrote:
> I encountered an issue with foreign table asynchronous scans.
>
> I'm testing on PostgreSQL 17 in a Windows WSL environment, with TCP keepalive settings configured as follows:
> configure:
> > tcp_keepalives_idle = 2
> > tcp_keepalives_interval = 2
> > tcp_keepalives_count = 3
> > tcp_user_timeout = 9
> >
> The foreign table was created with:
> sql:
> > CREATE TABLE ftab (a int, data varchar);
> I inserted test data using:
> sql:
> > INSERT INTO ftab
> > SELECT generate_series(1, 10000),
> >        substring(md5(random()::text) || repeat(md5(random()::text), 127), 1, 40960);
> Then I executed an async append query that unions with a simple query:
> sql:
> > SELECT f1(), 'local-data'
> > UNION ALL
> > SELECT * FROM f_ftab
> > LIMIT 2;
> Note: f1() is a function that sleeps for 10 seconds, created by:
> sql
> > CREATE OR REPLACE FUNCTION f1()
> > RETURNS integer
> > LANGUAGE plpgsql AS $$
> > BEGIN
> >     PERFORM pg_sleep(10);
> >     RETURN 1;
> > END; $$;
> The abnormal behavior was reproduced: the foreign data server connection was disconnected, with the following error
logs:
> text
> > 2026-03-10 17:52:55.969 CST [820] LOG: statement: FETCH 100 FROM c1
> > 2026-03-10 17:52:56.495 CST [820] LOG: could not receive data from client: Connection timed out
> > 2026-03-10 17:52:56.495 CST [820] LOG: unexpected EOF on client connection with an open transaction

I am not a network expert, but if you set tcp_user_timeout = 9, why are you surprised if the
connection times out after 9 seconds?

Yours,
Laurenz Albe




 I am not surprise the connection timeout, i just wonder if have some solution about this, because if we do not use asynchronous foreign table scan this sql query is ok.
 I just believe that local table scan too long will enhance the possibility of connection timeout. And our application set tcp_user_timeout parameter in consideration of high availability, so it can not be changed arbitrarily.

 when async fdw scan off, sql can be executed as follow:

postgres=# set enable_async_append TO off;
SET
postgres=#
postgres=# select f1(), *, '1','1','1','1','1' from local_fs union all select 1, * from foreign_sales limit 2;
 f1 | id | sale_date | product | amount | region | ?column? | ?column? | ?column? | ?column? | ?column?
----+----+------------+---------+---------+---------+----------+----------+----------+----------+----------
  1 | 1 | 2026-02-10 | pro | 1000.00 | tianjin | 1 | 1 | 1 | 1 | 1
  1 | 2 | 2026-02-10 | pro | 1000.00 | tianjin | 1 | 1 | 1 | 1 | 1
(2 rows)

2026-03-11 09:08:50.659 CST [861] LOG: duration: 20021.652 ms plan:
        Query Text: select f1(), *, '1','1','1','1','1' from local_fs union all select 1, * from foreign_sales limit 2;
        Limit (cost=0.00..0.54 rows=2 width=189) (actual time=10011.214..20021.544 rows=2 loops=1)
          -> Append (cost=0.00..2789043.10 rows=10240148 width=189) (actual time=10011.212..20021.539 rows=2 loops=1)
                -> Seq Scan on local_fs (cost=0.00..2737726.20 rows=10240120 width=189) (actual time=10011.192..20021.516 rows=2 loops=1)
                -> Foreign Scan on foreign_sales (cost=100.00..116.16 rows=28 width=2858) (never executed)
2026-03-11 09:08:50.659 CST [1324] LOG: statement: COMMIT TRANSACTION

At 2026-03-11 01:16:04, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:
>On Tue, 2026-03-10 at 18:17 +0800, jiye wrote:
>> I encountered an issue with foreign table asynchronous scans.
>> 
>> I'm testing on PostgreSQL 17 in a Windows WSL environment, with TCP keepalive settings configured as follows:
>> configure:
>> > tcp_keepalives_idle = 2
>> > tcp_keepalives_interval = 2
>> > tcp_keepalives_count = 3
>> > tcp_user_timeout = 9
>> > 
>> The foreign table was created with:
>> sql:
>> > CREATE TABLE ftab (a int, data varchar);
>> I inserted test data using:
>> sql:
>> > INSERT INTO ftab 
>> > SELECT generate_series(1, 10000), 
>> >        substring(md5(random()::text) || repeat(md5(random()::text), 127), 1, 40960);
>> Then I executed an async append query that unions with a simple query:
>> sql:
>> > SELECT f1(), 'local-data' 
>> > UNION ALL 
>> > SELECT * FROM f_ftab 
>> > LIMIT 2;
>> Note: f1() is a function that sleeps for 10 seconds, created by:
>> sql
>> > CREATE OR REPLACE FUNCTION f1() 
>> > RETURNS integer 
>> > LANGUAGE plpgsql AS $$
>> > BEGIN 
>> >     PERFORM pg_sleep(10); 
>> >     RETURN 1; 
>> > END; $$;
>> The abnormal behavior was reproduced: the foreign data server connection was disconnected, with the following error logs:
>> text
>> > 2026-03-10 17:52:55.969 CST [820] LOG: statement: FETCH 100 FROM c1
>> > 2026-03-10 17:52:56.495 CST [820] LOG: could not receive data from client: Connection timed out
>> > 2026-03-10 17:52:56.495 CST [820] LOG: unexpected EOF on client connection with an open transaction
>
>I am not a network expert, but if you set tcp_user_timeout = 9, why are you surprised if the
>connection times out after 9 seconds?
>
>Yours,
>Laurenz Albe
On Tuesday, March 10, 2026, jiye <jiye_sw@126.com> wrote:
This is a minimal working example. In practice, if the local table scan takes too long and the foreign table has sufficiently wide rows, this issue may reproduce.
In my understanding, when performing a local sequential scan, the PostgreSQL backend fetches data from the local plan without fetching any data from the FDW. As a result, the TCP receive buffer may become full, causing the FDW connection to be disconnected.
I believe this is a minor issue. How can I resolve this problem?

Do not establish a timeout that the execution of the query cannot beat.  Or, I think, at least ensure the non-async portion of the query can produce a row within the allotted time so the async node is polled within the timeout.  IIUC, the general loop flow is:  begin append, begin async, poll async, poll non-async, poll async, poll non-async, etc…. There will usually be some lag between async polls.  The tcp timeout has to be large enough to accommodate your reality.  No different than if you used a statement timeout.

I don’t actually know whether or if “buffer filling up” is accurate or relevant here.  It doesn’t seem that way.  You haven’t demonstrated that scenario here, just a timeout being reached.


And since the main design point of async is that any of them may be polled at any time it is necessary for all such scans to be initialized before any polling begins.  Starting the clock on all of them.

If you don’t want a connection timeout to happen do not set one.  That’s the resolution here so far as I can tell.

David J.


Sorry, I made a mistake about the tcp_user_timeout configuration. Our app sets it to 9000 (9 seconds), but it still errors out even with 9000 - it just takes a little longer to error.
And about this point : 
  => I don’t actually know whether or if “buffer filling up” is accurate or relevant here.  It doesn’t seem that way.  You haven’t demonstrated that scenario here, just a timeout being reached.
  Actually i have caputured tcp dump firstly, and "tcp buffer filling up" seem to be demonstrated by "TCP windows full" packet."
  Secondly if data of fetch rows are not sufficiently wide, it does not reproduce this issue. 

So i suspect that the reason for this connection timeout is that the tcp buffer is full.

At 2026-03-11 13:30:03, "David G. Johnston" <david.g.johnston@gmail.com> wrote:

On Tuesday, March 10, 2026, jiye <jiye_sw@126.com> wrote:
This is a minimal working example. In practice, if the local table scan takes too long and the foreign table has sufficiently wide rows, this issue may reproduce.
In my understanding, when performing a local sequential scan, the PostgreSQL backend fetches data from the local plan without fetching any data from the FDW. As a result, the TCP receive buffer may become full, causing the FDW connection to be disconnected.
I believe this is a minor issue. How can I resolve this problem?

Do not establish a timeout that the execution of the query cannot beat.  Or, I think, at least ensure the non-async portion of the query can produce a row within the allotted time so the async node is polled within the timeout.  IIUC, the general loop flow is:  begin append, begin async, poll async, poll non-async, poll async, poll non-async, etc…. There will usually be some lag between async polls.  The tcp timeout has to be large enough to accommodate your reality.  No different than if you used a statement timeout.

I don’t actually know whether or if “buffer filling up” is accurate or relevant here.  It doesn’t seem that way.  You haven’t demonstrated that scenario here, just a timeout being reached.


And since the main design point of async is that any of them may be polled at any time it is necessary for all such scans to be initialized before any polling begins.  Starting the clock on all of them.

If you don’t want a connection timeout to happen do not set one.  That’s the resolution here so far as I can tell.

David J.

On Wed, Mar 11, 2026 at 3:25 PM jiye <jiye_sw@126.com> wrote:
> Sorry, I made a mistake about the tcp_user_timeout configuration. Our app sets it to 9000 (9 seconds), but it still
errorsout even with 9000 - it just takes a little longer to error. 
> And about this point :
>   => I don’t actually know whether or if “buffer filling up” is accurate or relevant here.  It doesn’t seem that way.
You haven’t demonstrated that scenario here, just a timeout being reached. 
>   Actually i have caputured tcp dump firstly, and "tcp buffer filling up" seem to be demonstrated by "TCP windows
full"packet." 
>   Secondly if data of fetch rows are not sufficiently wide, it does not reproduce this issue.
>
> So i suspect that the reason for this connection timeout is that the tcp buffer is full.

I think this problem is not with async execution, but with your
environment; if the root cause of it is “TCP windows full”, I think it
might fix it to 1) retrieve only needed columns from the remote server
and 2) decrease the fetch_size option for postgres_fdw.

Best regards,
Etsuro Fujita