Обсуждение: Avoiding cartesian product

Поиск
Список
Период
Сортировка

Avoiding cartesian product

От
"Virag Saksena"
Дата:
I have a table which stores cumulative values
I would like to display/chart the deltas between successive data collections
 
If my primary key only increments by 1, I could write a simple query
 
select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
  from jam_trace_sys a, jam_trace_sys b
 where a.trace_id = 22
   and b.trace_id = a.trace_id
   and b.seq_no = a.seq_no + 1
 order by a.seq_no;
 
However the difference in sequence number is variable.
So (in Oracle) I used to extract the next seq_no using a correlated sub-query
 
select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
from jam_trace_sys a, jam_trace_sys b
where a.trace_id = 22
and (b.trace_id, b.seq_no) =
(select a.trace_id, min(c.seq_no) from jam_trace_sys c
where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
 order by a.seq_no;
 
For every row in A, The correlated sub-query from C will execute
With an appropriate index, it will just descend the index Btree
go one row to the right and return that row (min > :value)
and join to table B
 
SELECT STATEMENT
  SORT ORDER BY
   TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
     NESTED LOOPS
       TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS  A
         INDEX RANGE SCAN JAM_TRACE_SYS_N1  A
       INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
         SORT AGGREGATE
           INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
 
In postgreSQL A and B are doing a cartesian product
then C gets executed for every row in this cartesian product
and most of the extra rows get thrown out.
Is there any way to force an execution plan like above where the correlated subquery runs before going to B.
The table is small right now, but it will grow to have millions of rows
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=124911.81..124944.84 rows=13213 width=20) (actual time=13096.754..13097.053 rows=149 loops=1)
   Sort Key: a.seq_no
   ->  Nested Loop  (cost=4.34..124007.40 rows=13213 width=20) (actual time=1948.300..13096.329 rows=149 loops=1)
         Join Filter: (subplan)
         ->  Seq Scan on jam_trace_sys b  (cost=0.00..3.75 rows=175 width=16) (actual time=0.005..0.534 rows=175 loops=1)
         ->  Materialize  (cost=4.34..5.85 rows=151 width=16) (actual time=0.002..0.324 rows=150 loops=175)
               ->  Seq Scan on jam_trace_sys a  (cost=0.00..4.19 rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
                     Filter: (trace_id = 22)
         SubPlan
           ->  Aggregate  (cost=4.67..4.67 rows=1 width=4) (actual time=0.486..0.488 rows=1 loops=26250)
                 ->  Seq Scan on jam_trace_sys c  (cost=0.00..4.62 rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
                       Filter: ((trace_id = $0) AND (seq_no > $1))
 Total runtime: 13097.557 ms
(13 rows)
 
pglnx01=> \d jam_trace_sys
     Table "public.jam_trace_sys"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 trace_id        | integer |
 seq_no          | integer |
 cpu_utilization | integer |
 gc_minor        | integer |
 gc_major        | integer |
 heap_used       | integer |
Indexes:
    "jam_trace_sys_n1" btree (trace_id, seq_no)
 
pglnx01=> select count(*) from jam_trace_Sys ;
 count
-------
   175
(1 row)
 
pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
 trace_id | count
----------+-------
       15 |     2
       18 |    21
       22 |   150
       16 |     2
(4 rows)

Re: Avoiding cartesian product

От
Szűcs Gábor
Дата:
Dear Virag,

AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
indexes min and max, iirc).

Also, I don't think you need to exactly determine the trace_id. Try this one
(OTOH; might be wrong):

select DISTINCT ON (a.trace_id, a.seq_no)    -- See below
   b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
from jam_trace_sys a, jam_trace_sys b
where a.trace_id = 22
   and b.trace_id = a.trace_id
   and b.seq_no > a.seq_no            -- Simply ">" is enough
order by a.trace_id, a.seq_no, b.seq_no;    -- DISTINCT, see below

The trick is that DISTINCT takes the first one in each group (IIRC it is
granted, at least someone told me on one of these lists :) ) so if you order
by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
appropriate b.seq_no values for each DISTINCT values.

The idea of DISTINCTing by both columns is to make sure the planner finds
the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)

HTH,

--
G.


On 2006.01.04. 5:12, Virag Saksena wrote:
>
> I have a table which stores cumulative values
> I would like to display/chart the deltas between successive data
> collections
>
> If my primary key only increments by 1, I could write a simple query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
>   from jam_trace_sys a, jam_trace_sys b
>  where a.trace_id = 22
>    and b.trace_id = a.trace_id
>    and b.seq_no = a.seq_no + 1
>  order by a.seq_no;
>
> However the difference in sequence number is variable.
> So (in Oracle) I used to extract the next seq_no using a correlated
> sub-query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and (b.trace_id, b.seq_no) =
> (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
>  order by a.seq_no;
>
> For every row in A, The correlated sub-query from C will execute
> With an appropriate index, it will just descend the index Btree
> go one row to the right and return that row (min > :value)
> and join to table B
>
> SELECT STATEMENT
>   SORT ORDER BY
>    TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
>      NESTED LOOPS
>        TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS  A
>          INDEX RANGE SCAN JAM_TRACE_SYS_N1  A
>        INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
>          SORT AGGREGATE
>            INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
>
> In postgreSQL A and B are doing a cartesian product
> then C gets executed for every row in this cartesian product
> and most of the extra rows get thrown out.
> Is there any way to force an execution plan like above where the
> correlated subquery runs before going to B.
> The table is small right now, but it will grow to have millions of rows
> QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=124911.81..124944.84 rows=13213 width=20) (actual
> time=13096.754..13097.053 rows=149 loops=1)
>    Sort Key: a.seq_no
>    ->  Nested Loop  (cost=4.34..124007.40 rows=13213 width=20) (actual
> time=1948.300..13096.329 rows=149 loops=1)
>          Join Filter: (subplan)
>          ->  Seq Scan on jam_trace_sys b  (cost=0.00..3.75 rows=175
> width=16) (actual time=0.005..0.534 rows=175 loops=1)
>          ->  Materialize  (cost=4.34..5.85 rows=151 width=16) (actual
> time=0.002..0.324 rows=150 loops=175)
>                ->  Seq Scan on jam_trace_sys a  (cost=0.00..4.19
> rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
>                      Filter: (trace_id = 22)
>          SubPlan
>            ->  Aggregate  (cost=4.67..4.67 rows=1 width=4) (actual
> time=0.486..0.488 rows=1 loops=26250)
>                  ->  Seq Scan on jam_trace_sys c  (cost=0.00..4.62
> rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
>                        Filter: ((trace_id = $0) AND (seq_no > $1))
>  Total runtime: 13097.557 ms
> (13 rows)
>
> pglnx01=> \d jam_trace_sys
>      Table "public.jam_trace_sys"
>      Column      |  Type   | Modifiers
> -----------------+---------+-----------
>  trace_id        | integer |
>  seq_no          | integer |
>  cpu_utilization | integer |
>  gc_minor        | integer |
>  gc_major        | integer |
>  heap_used       | integer |
> Indexes:
>     "jam_trace_sys_n1" btree (trace_id, seq_no)
>
> pglnx01=> select count(*) from jam_trace_Sys ;
>  count
> -------
>    175
> (1 row)
>
> pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
>  trace_id | count
> ----------+-------
>        15 |     2
>        18 |    21
>        22 |   150
>        16 |     2
> (4 rows)

Re: Avoiding cartesian product

От
"Virag Saksena"
Дата:

Szűcs,
    Thanks for your suggestion, I guess there is more than one way to attack the problem.
 
I ended up using a trick with limit to get the next row ...
 
select (b.gc_minor- a.gc_minor), (b.gc_major- a.gc_major)
from jam_trace_sys a join jam_trace_sys b on
(b.seq_no = (select c.seq_no from jam_trace_sys c
 where c.trace_id = a.trace_id and c.seq_no > a.seq_no
 order by c.trace_id, c.seq_no limit 1)
 and b.trace_id = a.trace_id),
jam_tracesnap s1, jam_tracesnap s2
where s1.trace_id = a.trace_id
and s1.seq_no = a.seq_no
and s2.trace_id = b.trace_id
and s2.seq_no = b.seq_no
and a.trace_id = 22
order by a.seq_no;
 
This gave me a nice clean execution plan (there are some extra sources listed, but that is a bug in postgresql) ...
 
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=11.24..11.25 rows=1 width=20) (actual time=0.040..0.040 rows=0 loops=1)
 Sort Key: a.seq_no
 ->Nested Loop  (cost=0.00..11.23 rows=1 width=20) (actual time=0.028..0.028 rows=0 loops=1)
    Join Filter: ("inner".seq_no = "outer".seq_no)
    ->Nested Loop  (cost=0.00..9.20 rows=1 width=32) (actual time=0.024..0.024 rows=0 loops=1)
       Join Filter: ("inner".seq_no = "outer".seq_no)
       ->Nested Loop  (cost=0.00..7.17 rows=1 width=32) (actual time=0.020..0.020 rows=0 loops=1)
          Join Filter: ("inner".seq_no = (subplan))
          ->Index Scan using jam_trace_sys_n1 on jam_trace_sys a  (cost=0.00..3.41 rows=1 width=16) (actual time=0.016..0.016 rows=0 loops=1)
             Index Cond: (trace_id = 22)
          ->Index Scan using jam_trace_sys_n1 on jam_trace_sys b  (cost=0.00..3.41 rows=1 width=16) (never executed)
             Index Cond: (22 = trace_id)
          SubPlan
          ->Limit  (cost=0.00..0.33 rows=1 width=8) (never executed)
            ->Index Scan using jam_trace_sys_n1 on jam_trace_sys c  (cost=0.00..6.36 rows=19 width=8) (never executed)
               Index Cond: ((trace_id = $0) AND (seq_no > $1))
       ->Index Scan using jam_tracesnap_n1 on jam_tracesnap s1  (cost=0.00..2.01 rows=1 width=8) (never executed)
          Index Cond: (22 = trace_id)
    ->Index Scan using jam_tracesnap_n1 on jam_tracesnap s2  (cost=0.00..2.01 rows=1 width=8) (never executed)
       Index Cond: (22 = trace_id)
Regards,
 
Virag

----- Original Message -----
From: "Szűcs Gábor" <surrano@gmail.com>
Sent: Monday, January 09, 2006 7:59 AM
Subject: Re: Avoiding cartesian product

> Dear Virag,
>
> AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
> indexes min and max, iirc).
>
> Also, I don't think you need to exactly determine the trace_id. Try this one
> (OTOH; might be wrong):
>
> select DISTINCT ON (a.trace_id, a.seq_no) -- See below
>    b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
>    and b.trace_id = a.trace_id
>    and b.seq_no > a.seq_no -- Simply ">" is enough
> order by a.trace_id, a.seq_no, b.seq_no; -- DISTINCT, see below
>
> The trick is that DISTINCT takes the first one in each group (IIRC it is
> granted, at least someone told me on one of these lists :) ) so if you order
> by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
> appropriate b.seq_no values for each DISTINCT values.
>
> The idea of DISTINCTing by both columns is to make sure the planner finds
> the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
> used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)
>
> HTH,
>
> --
> G.
>
>
> On 2006.01.04. 5:12, Virag Saksena wrote:
> >
> > I have a table which stores cumulative values
> > I would like to display/chart the deltas between successive data
> > collections
> > 
> > If my primary key only increments by 1, I could write a simple query
> > 
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> >   from jam_trace_sys a, jam_trace_sys b
> >  where a.trace_id = 22
> >    and b.trace_id = a.trace_id
> >    and b.seq_no = a.seq_no + 1
> >  order by a.seq_no;
> > 
> > However the difference in sequence number is variable.
> > So (in Oracle) I used to extract the next seq_no using a correlated
> > sub-query
> > 
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where a.trace_id = 22
> > and (b.trace_id, b.seq_no) =
> > (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> > where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
> >  order by a.seq_no;
> > 
> > For every row in A, The correlated sub-query from C will execute
> > With an appropriate index, it will just descend the index Btree
> > go one row to the right and return that row (min > :value)
> > and join to table B
> > 
> > SELECT STATEMENT
> >   SORT ORDER BY
> >    TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
> >      NESTED LOOPS
> >        TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS  A
> >          INDEX RANGE SCAN JAM_TRACE_SYS_N1  A
> >        INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
> >          SORT AGGREGATE
> >            INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
> > 
> > In postgreSQL A and B are doing a cartesian product
> > then C gets executed for every row in this cartesian product
> > and most of the extra rows get thrown out.
> > Is there any way to force an execution plan like above where the
> > correlated subquery runs before going to B.
> > The table is small right now, but it will grow to have millions of rows
> > QUERY PLAN
> > -----------------------------------------------------------------------------------------------------------------------------------
> >  Sort  (cost=124911.81..124944.84 rows=13213 width=20) (actual
> > time=13096.754..13097.053 rows=149 loops=1)
> >    Sort Key: a.seq_no
> >    ->  Nested Loop  (cost=4.34..124007.40 rows=13213 width=20) (actual
> > time=1948.300..13096.329 rows=149 loops=1)
> >          Join Filter: (subplan)
> >          ->  Seq Scan on jam_trace_sys b  (cost=0.00..3.75 rows=175
> > width=16) (actual time=0.005..0.534 rows=175 loops=1)
> >          ->  Materialize  (cost=4.34..5.85 rows=151 width=16) (actual
> > time=0.002..0.324 rows=150 loops=175)
> >                ->  Seq Scan on jam_trace_sys a  (cost=0.00..4.19
> > rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
> >                      Filter: (trace_id = 22)
> >          SubPlan
> >            ->  Aggregate  (cost=4.67..4.67 rows=1 width=4) (actual
> > time=0.486..0.488 rows=1 loops=26250)
> >                  ->  Seq Scan on jam_trace_sys c  (cost=0.00..4.62
> > rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
> >                        Filter: ((trace_id = $0) AND (seq_no > $1))
> >  Total runtime: 13097.557 ms
> > (13 rows)
> > 
> > pglnx01=> \d jam_trace_sys
> >      Table "public.jam_trace_sys"
> >      Column      |  Type   | Modifiers
> > -----------------+---------+-----------
> >  trace_id        | integer |
> >  seq_no          | integer |
> >  cpu_utilization | integer |
> >  gc_minor        | integer |
> >  gc_major        | integer |
> >  heap_used       | integer |
> > Indexes:
> >     "jam_trace_sys_n1" btree (trace_id, seq_no)
> > 
> > pglnx01=> select count(*) from jam_trace_Sys ;
> >  count
> > -------
> >    175
> > (1 row)
> > 
> > pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
> >  trace_id | count
> > ----------+-------
> >        15 |     2
> >        18 |    21
> >        22 |   150
> >        16 |     2
> > (4 rows)
>