Re: pull-up subquery if JOIN-ON contains refs to upper-query

Поиск
Список
Период
Сортировка
От Alena Rybakina
Тема Re: pull-up subquery if JOIN-ON contains refs to upper-query
Дата
Msg-id 87a6fcbf-929a-408a-9471-28be3080254a@postgrespro.ru
обсуждение исходный текст
Ответ на Re: pull-up subquery if JOIN-ON contains refs to upper-query  (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>)
Список pgsql-hackers

Hi, all! I updated the patch and it looks nice. All the problems have been solved.

On 03.04.2025 16:56, Ilia Evdokimov wrote:

On 02.04.2025 19:39, Alena Rybakina wrote:

I see that I need to add a walker that, when traversing the tree, determines whether there are conditions under which pull-up is impossible - the presence of
volatility of functions and other restrictions, and leave the transformation for the var objects that I added before, I described it here.


I have some concerns about pulling up every clause from the subquery with one column. In particular, not every clause is safe or beneficial to pull up: OR-clauses, CASE expressions, nested sublinks could significantly change how the planner estimates the number of rows or applies filters, especially when they are not true join predicates. Pulling them up might lead to worse plans, or even change the semantics in subtle ways. I think before applying such transformations, we should make sure they are not only safe but actually improve the resulting plan.

There may indeed be cases where a query plan without pull-up is worse than with pull-up. 

For example, as shown below, with pull-up we don't need to scan two tables and perform a join, since the subquery returns 0 rows (no matching tuples in the inner sequential scan in a parameterized Nested Loop).
However, this cannot be detected at the current planning stage - we simply don't have that information yet. 

Do you have any ideas on how to solve this problem? So far, the only approach I see is to try an alternative plan but I'm still learning this.

For example:
create table t(x int);
create table t1(x int);
create table t2(x int);

insert into t2 select id from generate_series(20001,30000) as id;
insert into t1 select id from generate_series(10001,20000) as id;
insert into t select id from generate_series(1,10000) as id;
vacuum analyze;
explain analyze select * from t where exists (select * from t1 join t2 on t.x = t1.x);

with my patch:
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1500540.00..1500822.50 rows=10000 width=4) (actual time=70694.658..70694.662 rows=0.00 loops=1)
   Hash Cond: (t.x = t1.x)
   Buffers: shared hit=135
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.009..1.545 rows=10000.00 loops=1)
         Buffers: shared hit=45
   ->  Hash  (cost=1500415.00..1500415.00 rows=10000 width=4) (actual time=70690.524..70690.526 rows=10000.00 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         Buffers: shared hit=90
         ->  HashAggregate  (cost=1500315.00..1500415.00 rows=10000 width=4) (actual time=70683.143..70686.590 rows=10000.00 loops=1)
               Group Key: t1.x
               Batches: 1  Memory Usage: 793kB
               Buffers: shared hit=90
               ->  Nested Loop  (cost=0.00..1250315.00 rows=100000000 width=4) (actual time=0.019..25650.447 rows=100000000.00 loops=1)
                     Buffers: shared hit=90
                     ->  Seq Scan on t1  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.006..4.931 rows=10000.00 loops=1)
                           Buffers: shared hit=45
                     ->  Materialize  (cost=0.00..195.00 rows=10000 width=0) (actual time=0.000..0.875 rows=10000.00 loops=10000)
                           Storage: Memory  Maximum Storage: 519kB
                           Buffers: shared hit=45
                           ->  Seq Scan on t2  (cost=0.00..145.00 rows=10000 width=0) (actual time=0.007..1.246 rows=10000.00 loops=1)
                                 Buffers: shared hit=45
 Planning:
   Buffers: shared hit=36 read=3
 Planning Time: 0.375 ms
 Execution Time: 70695.154 ms


without my patch:

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..309.30 rows=5738 width=4) (actual time=68268.562..68268.565 rows=0.00 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 10000
   Buffers: shared hit=900045
   SubPlan 1
     ->  Nested Loop  (cost=0.00..8524.27 rows=654075 width=0) (actual time=6.823..6.823 rows=0.00 loops=10000)
           Buffers: shared hit=900000
           ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475 width=0) (actual time=0.011..1.660 rows=10000.00 loops=10000)
                 Buffers: shared hit=450000
           ->  Materialize  (cost=0.00..188.72 rows=57 width=0) (actual time=0.000..0.000 rows=0.00 loops=100000000)
                 Storage: Memory  Maximum Storage: 17kB
                 Buffers: shared hit=450000
                 ->  Seq Scan on t1  (cost=0.00..188.44 rows=57 width=0) (actual time=2.403..2.403 rows=0.00 loops=10000)
                       Filter: (t.x = x)
                       Rows Removed by Filter: 10000
                       Buffers: shared hit=450000
 Planning:
   Buffers: shared hit=40 read=16
 Planning Time: 0.487 ms
 Execution Time: 68268.600 ms

-- 
Regards,
Alena Rybakina
Postgres Professional
Вложения

В списке pgsql-hackers по дате отправления: