I intend to understand further on PostgreSQL Index behavior on a "SELECT" statement.
We have a situation where-in Index on unique column is not being picked up as expected when used with-in the WHERE clause with other non-unique columns using AND operator.
explain SELECT tv.short_code, tv.chn as pkg_subscription_chn,
tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
FROM campaign_base ubs
JOIN tab_current_day_v2 tv
ON ubs.ubs_seq_id = tv.ubs_seq_id
AND tv.dt = CAST('2012-09-08' AS DATE)
GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
QUERY PLAN
----------------------------------------------------------------------------------------------------
HashAggregate (cost=77754.57..77754.58 rows=1 width=38)
-> Nested Loop (cost=0.00..77754.56 rows=1 width=38)
-> Seq Scan on tab_current_day_v2 tv (cost=0.00..77746.26 rows=1 width=39)
Filter: (dt = '2012-09-08'::date)
-> Index Scan using cb_ubs_id_idx on campaign_base ubs (cost=0.00..8.28 rows=1 width=15)
Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)
The above plan shows "seq scan" on tab_current_day_v2 table, though there is an index on "ubs_seq_id" column which is an unique column.
Can anyone please help us understand, why PostgreSQL optimizer is not prioritizing the unique column and hitting ubs_seq_id_idx Index here ?
Later -
We have created composite Index on "dt" (one distinct value) and "ubs_seq_id" (no duplicate values) and the index has been picked up.
Below is the scenario where-in the same query's plan picking up the composite Index.
prod-db=# create index concurrently tab_dt_ubs_seq_id_idx on tab_current_day_v2(dt,ubs_seq_id);
CREATE INDEX
prod-db=# explain SELECT tv.short_code, tv.chn as pkg_subscription_chn,
tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
FROM campaign_base ubs
JOIN tab_current_day_v2 tv
ON ubs.ubs_seq_id = tv.ubs_seq_id
AND tv.dt = CAST('2012-09-08' AS DATE)
GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16.88..16.89 rows=1 width=38)
-> Nested Loop (cost=0.00..16.86 rows=1 width=38)
-> Index Scan using tab_dt_ubs_seq_id_idx on tab_current_day_v2 tv (cost=0.00..8.57 rows=1 width=39)
Index Cond: (dt = '2012-09-08'::date)
-> Index Scan using cb_ubs_id_idx on campaign_base ubs (cost=0.00..8.28 rows=1 width=15)
Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)
I was expecting the above behavior without a composite Index. A column with most unique values must be picked up when multiple columns are used in WHERE clause using AND operator. Any thoughts ?
prod-db# \d tab_current_day_v2
Table "public.tab_current_day_v2"
Column | Type | Modifiers
--------------------------+--------------------------+-----------
dt | date |
chn | character varying(10) |
vert | character varying(20) |
isdn | character varying |
bc | character varying(40) |
status | text |
is_rene | boolean |
age_in_sys | integer |
age_in_grace | integer |
has_prof | boolean |
short_code | character varying |
sub_vert | character varying(30) |
mode | character varying |
ubs_seq_id | bigint |
pkg_name | character varying(200) |
pkg_id | integer |
subs_charge | money |
subs_time | timestamp with time zone |
ulq_seq_id | bigint |
valid_till_time | timestamp with time zone |
valid_from_time | timestamp with time zone |
latest_ube_seq_id | bigint |
latest_pkg_id | integer |
price | integer |
Indexes:
"tab_dt_ubs_seq_id_idx" btree (dt, ubs_seq_id)
"tab_isdn_idx" btree (msisdn)
"tab_status_idx" btree (status)
"ubs_seq_id_idx" btree (ubs_seq_id)
Below is the table structure and the uniqueness of each of the columns.
airtel_user_data_oltp=# select attname, n_distinct from pg_Stats where tablename='tab_current_day_v2';
attname | n_distinct
--------------------------+------------
dt | 1
chn | 7
vert | 94
isdn | -0.727331
bc | 4
status | 3
is_rene | 2
age_in_sys | 1018
age_in_grac | 369
has_prof | 2
short_code | 23
sub_vert | 5
mode | 0
ubs_seq_id | -1
pkg_name | 461
pkg_id | 461
subs_charge | 7
subs_time | -1
ulq_seq_id | 122887
valid_till_time | -0.966585
valid_from_time | -0.962563
latest_ube_seq_id | -1
latest_pkg_id | 475
price | 18
(24 rows)
This is not an issue, but, would like to understand how PostgreSQL optimizer picks up Indexes in SELECT queries.
In an other scenario, we had used 4 columns in WHERE clause with AND operator with an Index on the column with most unique values -- The Index was picked up.