Обсуждение: Re: Bypassing cursors in postgres_fdw to enable parallel plans
Rafia Sabih <rafia.pghackers@gmail.com> writes:
Hi,
>
> At present, in postgres_fdw, if a query which is using a parallel plan is fired from the remote end fails to use the
> parallel plan locally because of the presence of CURSORS. Consider the following example,
...
>
> Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing
the
> cursors.
Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?
/*
* Set up parallel mode if appropriate.
*
* Parallel mode only supports complete execution of a plan. If we've
* already partially executed it, or if the caller asks us to exit early,
* we must force the plan to run without parallelism.
*/
if (queryDesc->already_executed || numberTuples != 0)
use_parallel_mode = false;
Actually I can't understand the comment as well and I had this
confusion for a long time.
--
Best Regards
Andy Fan
Hello hackers,
I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
So, the tuples retrieved earlier for the first query were lost if not saved.
I would highly appreciate your time and feedback for this.
On Wed, 12 Mar 2025 at 12:57, Andy Fan <zhihuifan1213@163.com> wrote:
Rafia Sabih <rafia.pghackers@gmail.com> writes:
Hi,
>
> At present, in postgres_fdw, if a query which is using a parallel plan is fired from the remote end fails to use the
> parallel plan locally because of the presence of CURSORS. Consider the following example,
...
>
> Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing the
> cursors.
Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?
/*
* Set up parallel mode if appropriate.
*
* Parallel mode only supports complete execution of a plan. If we've
* already partially executed it, or if the caller asks us to exit early,
* we must force the plan to run without parallelism.
*/
if (queryDesc->already_executed || numberTuples != 0)
use_parallel_mode = false;
Actually I can't understand the comment as well and I had this
confusion for a long time.
--
Best Regards
Andy Fan
Regards,
Rafia Sabih
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Вложения
On Wed, Mar 12, 2025 at 7:57 AM Andy Fan <zhihuifan1213@163.com> wrote: > Do you know why we can't use parallel plan when cursor is used? Is It > related to this code in ExecutePlan? Yes. When a cursor is used, the whole query isn't executed all at once, but rather the executor will be started and stopped for each fetch from the cursor. We can't keep the parallel workers running for that whole time, not just because it would be inefficient, but because it would be incorrect. State changes would be possible in the leader that were not reflected in the workers, leading to chaos. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach. > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query. This might have the advantage of being simpler, but it's definitely worse. If we're only fetching one result set, which will be common, we'll buffer the whole thing in a tuplestore where that could otherwise be avoided. Maybe it's still best to go with this approach; not sure. > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query. > So, the tuples retrieved earlier for the first query were lost if not saved. > I would highly appreciate your time and feedback for this. My suggestions are to work on the following areas: 1. Automated testing. The patch has no regression tests, and won't get committed without those. 2. Manual testing. How does the performance with this new option compare to the existing method? The answer might be different for small result sets that fit in memory and large ones that spill to disk, and will certainly also depend on how productive parallel query can be on the remote side; but I think you want to do and post on this list some measurements showing the best and worst case for the patch. 3. Patch clean-up. There are plenty of typos and whitespace-only changes in the patch. It's best to clean those up. Running pgindent is a good idea too. Some places could use comments. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.
In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.
I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)
Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)
Time: 24.862 ms
So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1) Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)
So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)
Time: 24.862 ms
So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1) Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)
So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.
To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)
So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)
So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.
Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH