Re: Avoiding cartesian product

Поиск
Список
Период
Сортировка
От Szűcs Gábor
Тема Re: Avoiding cartesian product
Дата
Msg-id 43C28854.9000605@gmail.com
обсуждение исходный текст
Ответ на Avoiding cartesian product  ("Virag Saksena" <virag@auptyma.com>)
Список pgsql-performance
Dear Virag,

AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
indexes min and max, iirc).

Also, I don't think you need to exactly determine the trace_id. Try this one
(OTOH; might be wrong):

select DISTINCT ON (a.trace_id, a.seq_no)    -- See below
   b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
from jam_trace_sys a, jam_trace_sys b
where a.trace_id = 22
   and b.trace_id = a.trace_id
   and b.seq_no > a.seq_no            -- Simply ">" is enough
order by a.trace_id, a.seq_no, b.seq_no;    -- DISTINCT, see below

The trick is that DISTINCT takes the first one in each group (IIRC it is
granted, at least someone told me on one of these lists :) ) so if you order
by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
appropriate b.seq_no values for each DISTINCT values.

The idea of DISTINCTing by both columns is to make sure the planner finds
the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)

HTH,

--
G.


On 2006.01.04. 5:12, Virag Saksena wrote:
>
> I have a table which stores cumulative values
> I would like to display/chart the deltas between successive data
> collections
>
> If my primary key only increments by 1, I could write a simple query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
>   from jam_trace_sys a, jam_trace_sys b
>  where a.trace_id = 22
>    and b.trace_id = a.trace_id
>    and b.seq_no = a.seq_no + 1
>  order by a.seq_no;
>
> However the difference in sequence number is variable.
> So (in Oracle) I used to extract the next seq_no using a correlated
> sub-query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and (b.trace_id, b.seq_no) =
> (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
>  order by a.seq_no;
>
> For every row in A, The correlated sub-query from C will execute
> With an appropriate index, it will just descend the index Btree
> go one row to the right and return that row (min > :value)
> and join to table B
>
> SELECT STATEMENT
>   SORT ORDER BY
>    TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
>      NESTED LOOPS
>        TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS  A
>          INDEX RANGE SCAN JAM_TRACE_SYS_N1  A
>        INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
>          SORT AGGREGATE
>            INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
>
> In postgreSQL A and B are doing a cartesian product
> then C gets executed for every row in this cartesian product
> and most of the extra rows get thrown out.
> Is there any way to force an execution plan like above where the
> correlated subquery runs before going to B.
> The table is small right now, but it will grow to have millions of rows
> QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=124911.81..124944.84 rows=13213 width=20) (actual
> time=13096.754..13097.053 rows=149 loops=1)
>    Sort Key: a.seq_no
>    ->  Nested Loop  (cost=4.34..124007.40 rows=13213 width=20) (actual
> time=1948.300..13096.329 rows=149 loops=1)
>          Join Filter: (subplan)
>          ->  Seq Scan on jam_trace_sys b  (cost=0.00..3.75 rows=175
> width=16) (actual time=0.005..0.534 rows=175 loops=1)
>          ->  Materialize  (cost=4.34..5.85 rows=151 width=16) (actual
> time=0.002..0.324 rows=150 loops=175)
>                ->  Seq Scan on jam_trace_sys a  (cost=0.00..4.19
> rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
>                      Filter: (trace_id = 22)
>          SubPlan
>            ->  Aggregate  (cost=4.67..4.67 rows=1 width=4) (actual
> time=0.486..0.488 rows=1 loops=26250)
>                  ->  Seq Scan on jam_trace_sys c  (cost=0.00..4.62
> rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
>                        Filter: ((trace_id = $0) AND (seq_no > $1))
>  Total runtime: 13097.557 ms
> (13 rows)
>
> pglnx01=> \d jam_trace_sys
>      Table "public.jam_trace_sys"
>      Column      |  Type   | Modifiers
> -----------------+---------+-----------
>  trace_id        | integer |
>  seq_no          | integer |
>  cpu_utilization | integer |
>  gc_minor        | integer |
>  gc_major        | integer |
>  heap_used       | integer |
> Indexes:
>     "jam_trace_sys_n1" btree (trace_id, seq_no)
>
> pglnx01=> select count(*) from jam_trace_Sys ;
>  count
> -------
>    175
> (1 row)
>
> pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
>  trace_id | count
> ----------+-------
>        15 |     2
>        18 |    21
>        22 |   150
>        16 |     2
> (4 rows)

В списке pgsql-performance по дате отправления:

Предыдущее
От: Harry Jackson
Дата:
Сообщение: Re: help tuning queries on large database
Следующее
От: Alessandro Baretta
Дата:
Сообщение: Re: 500x speed-down: Wrong query plan?