Index usage with sub select or inner joins

От: Julien Theulier
Тема: Index usage with sub select or inner joins
Дата: ,
Msg-id: 003c01c944c9$c1d92ac0$458b8040$@com
(см: обсуждение, исходный текст)
Ответ на: Re: Using index for IS NULL query  (Tomas Vondra)
Ответы: Re: Index usage with sub select or inner joins  (Joshua Tolley)
Список: pgsql-performance

Скрыть дерево обсуждения

Using index for IS NULL query  ("Andrus", )
 Re: Using index for IS NULL query  (Tom Lane, )
  Re: Using index for IS NULL query  (Andreas Kretschmer, )
  Re: Using index for IS NULL query  (Matthew Wakeling, )
 Re: Using index for IS NULL query  (Andreas Kretschmer, )
 Re: Using index for IS NULL query  (Tomas Vondra, )
  Re: Using index for IS NULL query  ("Vladimir Sitnikov", )
   Re: Using index for IS NULL query  ("Scott Marlowe", )
   Re: Using index for IS NULL query  (Tomas Vondra, )
  Index usage with sub select or inner joins  ("Julien Theulier", )
   Re: Index usage with sub select or inner joins  (Joshua Tolley, )
    Re: Index usage with sub select or outer joins  ("Julien Theulier", )

Hello,

I am doing some performances testing on Postgres & I discovered the
following behavior, when using 2 different ways of writing selects (but
doing the same aggregations at the end):
1. test case 1, using outer join:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,a.item_id,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '3 day') then 1 else 0 end)) as recent_sales_3d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '7 day') then 1 else 0 end)) as recent_sales_7d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '14 day') then 1 else 0 end)) as recent_sales_14d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '30 day') then 1 else 0 end)) as recent_sales_30d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '60 day') then 1 else 0 end)) as recent_sales_60d1
from bm_us_views_main_1609 a
left outer join bm_us_bids b on (b.item_id=a.item_id and
b.bid_date<a.pv_timestamp and (b.bid_date>=a.pv_timestamp - INTERVAL '60
day'))
where a.item_type in (7,9) and qty>1
group by soj_session_log_id, pv_timestamp, vi_pv_id, a.item_id;;

This query doesn't use any index according to the explain plan:
"HashAggregate  (cost=672109.07..683054.81 rows=182429 width=49)"
"  ->  Merge Left Join  (cost=646489.83..668004.42 rows=182429 width=49)"
"        Merge Cond: (a.item_id = b.item_id)"
"        Join Filter: ((b.bid_date < a.pv_timestamp) AND (b.bid_date >=
(a.pv_timestamp - '60 days'::interval)))"
"        ->  Sort  (cost=331768.62..332224.69 rows=182429 width=41)"
"              Sort Key: a.item_id"
"              ->  Seq Scan on bm_us_views_main_1609 a
(cost=0.00..315827.08 rows=182429 width=41)"
"                    Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND
(qty > 1))"
"        ->  Sort  (cost=314669.01..320949.52 rows=2512205 width=19)"
"              Sort Key: b.item_id"
"              ->  Seq Scan on bm_us_bids b  (cost=0.00..47615.05
rows=2512205 width=19)"

2. Test case 2, using sub queries:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,item_id,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '3 day' group by
item_id ),0) as recent_sales_3d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '7 day' group by
item_id ),0) as recent_sales_7d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '14 day' group
by item_id ),0) as recent_sales_14d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '30 day' group
by item_id ),0) as recent_sales_30d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '60 day' group
by item_id ),0) as recent_sales_60d
from bm_us_views_main_1609 a
where item_type in (7,9) and qty>1;

This query uses indexes  according to the explain plan:
"Seq Scan on bm_us_views_main_1609 a  (cost=0.00..8720230.77 rows=182429
width=41)"
"  Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty > 1))"
"  SubPlan"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '60 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '30 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '14 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '7 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '3 days'::interval)))"

The index bm_us_bids_item_ix is on columns item_id, bidder_id, bid_date


QUESTION: Why the planner choose seq scan in the first case & indexes scan
in the second case? In a more general way, I observed that the planner has
difficulties to select index scans & does in almost all the cases seq scan,
when doing join queries. After investigations, it looks like when you join
table a with table b on a column x and y and you have an index on column x
only, the planner is not able to choose the index scan. You have to build
the index corresponding exactly to the join statement btw the 2 tables

For example,by creating an new index on item_id and bid_date, the planner
has been able to choose this last index in both cases. Would it be possible
that the planner can choose in any case the closest index for queries having
outer join

Last thing, I am running Postgres 8.3.4 on a Windows laptop having 3.5Gb
RAM, 161Gb disk and dual core 2.5Gz processor

Regards,
Julien Theulier

Вложения

В списке pgsql-performance по дате сообщения:

От: "Vladimir Sitnikov"
Дата:
Сообщение: Re: Increasing select max(datecol) from bilkaib where datecol<=date'2008-11-01' and (cr='00' or db='00') speed
От: hubert depesz lubaczewski
Дата:
Сообщение: Re: Increasing select max(datecol) from bilkaib where datecol<=date'2008-11-01' and (cr='00' or db='00') speed