Обсуждение: Single column vs composite partial index

Поиск
Список
Период
Сортировка

Single column vs composite partial index

От
Nagaraj Raj
Дата:
Hi,

I'm running one query, and I created two types of index one is composite and the other one with single column one and query planner showing almost the same cost for both index bitmap scan, I'm not sure which is appropriate to keep in production tables.


explain analyze
SELECT BAN, SUBSCRIBER_NO, ACTV_CODE, ACTV_RSN_CODE, EFFECTIVE_DATE, TRX_SEQ_NO, LOAD_DTTM, rnk AS RNK  FROM (
SELECT CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE, CT.EFFECTIVE_DATE, CT.TRX_SEQ_NO, CT.LOAD_DTTM,
row_number() over (partition by CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE order by CT.TRX_SEQ_NO DESC, CT.LOAD_DTTM DESC) rnk
FROM SAM_T.L_CSM_TRANSACTIONS CT WHERE CT.ACTV_CODE in ( 'NAC', 'CAN', 'RSP', 'RCL') AND LOAD_DTTM::DATE >= CURRENT_DATE - 7
) S WHERE RNK = 1
1st Index with single column:

CREATE INDEX l_csm_transactions_load_dttm_idx1
    ON sam_t.l_csm_transactions USING btree
    (load_dttm ASC NULLS LAST)


/*"Subquery Scan on s  (cost=32454.79..33555.15 rows=129 width=61) (actual time=56.473..56.473 rows=0 loops=1)"
"  Filter: (s.rnk = 1)"
"  ->  WindowAgg  (cost=32454.79..33231.52 rows=25891 width=61) (actual time=56.472..56.472 rows=0 loops=1)"
"        ->  Sort  (cost=32454.79..32519.51 rows=25891 width=53) (actual time=56.470..56.470 rows=0 loops=1)"
"              Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm DESC"
"              Sort Method: quicksort  Memory: 25kB"
"              ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1271.13..30556.96 rows=25891 width=53) (actual time=56.462..56.462 rows=0 loops=1)"
"                    Recheck Cond: ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[]))"
"                    Filter: ((load_dttm)::date >= (CURRENT_DATE - 7))"
"                    Rows Removed by Filter: 79137"
"                    Heap Blocks: exact=23976"
"                    ->  Bitmap Index Scan on l_csm_transactions_actv_code_idx1  (cost=0.00..1264.66 rows=77673 width=0) (actual time=6.002..6.002 rows=79137 loops=1)"
"Planning Time: 0.270 ms"
"Execution Time: 56.639 ms"*/

2nd one with composite and partial index:

CREATE INDEX l_csm_transactions_actv_code_load_dttm_idx1
    ON sam_t.l_csm_transactions USING btree
    (actv_code COLLATE pg_catalog."default" ASC NULLS LAST, (load_dttm::date) DESC NULLS FIRST)
    WHERE actv_code::text = ANY (ARRAY['NAC'::character varying, 'CAN'::character varying, 'RSP'::character varying, 'RCL'::character varying]::text[]);


/*"Subquery Scan on s  (cost=32023.15..33123.52 rows=129 width=61) (actual time=2.256..2.256 rows=0 loops=1)"
"  Filter: (s.rnk = 1)"
"  ->  WindowAgg  (cost=32023.15..32799.88 rows=25891 width=61) (actual time=2.255..2.255 rows=0 loops=1)"
"        ->  Sort  (cost=32023.15..32087.88 rows=25891 width=53) (actual time=2.254..2.254 rows=0 loops=1)"
"              Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm DESC"
"              Sort Method: quicksort  Memory: 25kB"
"              ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1449.32..30125.32 rows=25891 width=53) (actual time=2.247..2.247 rows=0 loops=1)"
"                    Recheck Cond: (((load_dttm)::date >= (CURRENT_DATE - 7)) AND ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[])))"
"                    ->  Bitmap Index Scan on l_csm_transactions_actv_code_load_dttm_idx1  (cost=0.00..1442.85 rows=25891 width=0) (actual time=2.244..2.245 rows=0 loops=1)"
"                          Index Cond: ((load_dttm)::date >= (CURRENT_DATE - 7))"
"Planning Time: 0.438 ms"
"Execution Time: 2.303 ms"*/



Please suggest me the best choice.

Appritiated the responce. 


Thanks,
Rj


Re: Single column vs composite partial index

От
Justin Pryzby
Дата:
On Tue, Sep 15, 2020 at 10:33:24PM +0000, Nagaraj Raj wrote:
> Hi,
> I'm running one query, and I created two types of index one is composite and the other one with single column one and
queryplanner showing almost the same cost for both index bitmap scan, I'm not sure which is appropriate to keep in
productiontables.
 

You're asking whether to keep one index or the other ?
It depends on *all* the queries you'll run, not just this one.
The most general thing to do would be to make multiple, single column indexes,
and let the planner figure out which is best (it might bitmap-AND or -OR them
together).

However, for this query, you can see the 2nd query is actually faster (2ms vs
56ms) - the cost is an estimate based on a model.

The actual performance might change based on thing like maintenance like
reindex, cluster, vacuum, hardware, and DB state (like cached blocks).
And postgres version.

The rowcount estimates are bad.  Maybe you need to ANALYZE the table (or adjust
the autoanalyze thresholds), or evaluate if there's a correlation between
columns.  Bad rowcount estimates beget bad plans and poor performance.

Also: you could use explain(ANALYZE,BUFFERS).
I think the fast plan would be possible with a tiny BRIN index on load_dttm.
(Possibly combined indexes on actv_code or others).
If you also have a btree index on time, then you can CLUSTER on it (and
analyze) and it might improve that plan further (but would affect other
queries, too).

> explain analyze SELECT BAN, SUBSCRIBER_NO, ACTV_CODE, ACTV_RSN_CODE, EFFECTIVE_DATE, TRX_SEQ_NO, LOAD_DTTM, rnk AS
RNK FROM ( SELECT CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE, CT.EFFECTIVE_DATE, CT.TRX_SEQ_NO,
CT.LOAD_DTTM,row_number() over (partition by CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE order by
CT.TRX_SEQ_NODESC, CT.LOAD_DTTM DESC) rnk FROM SAM_T.L_CSM_TRANSACTIONS CT WHERE CT.ACTV_CODE in ( 'NAC', 'CAN', 'RSP',
'RCL')AND LOAD_DTTM::DATE >= CURRENT_DATE - 7 ) S WHERE RNK = 1
 

> 1st Index with single column: 
> CREATE INDEX l_csm_transactions_load_dttm_idx1    ON sam_t.l_csm_transactions USING btree    (load_dttm ASC NULLS
LAST)

>  /*"Subquery Scan on s  (cost=32454.79..33555.15 rows=129 width=61) (actual time=56.473..56.473 rows=0 loops=1)
>    Filter: (s.rnk = 1)
>    ->  WindowAgg  (cost=32454.79..33231.52 rows=25891 width=61) (actual time=56.472..56.472 rows=0 loops=1)
>          ->  Sort  (cost=32454.79..32519.51 rows=25891 width=53) (actual time=56.470..56.470 rows=0 loops=1)
>                Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm
DESC
>                Sort Method: quicksort  Memory: 25kB
>                ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1271.13..30556.96 rows=25891 width=53) (actual
time=56.462..56.462rows=0 loops=1)
 
>                      Recheck Cond: ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[]))
>                      Filter: ((load_dttm)::date >= (CURRENT_DATE - 7))
>                      Rows Removed by Filter: 79137
>                      Heap Blocks: exact=23976
>                      ->  Bitmap Index Scan on l_csm_transactions_actv_code_idx1  (cost=0.00..1264.66 rows=77673
width=0)(actual time=6.002..6.002 rows=79137 loops=1)
 
>  Planning Time: 0.270 ms
>  Execution Time: 56.639 ms"*/

> 2nd one with composite and partial index:
> CREATE INDEX l_csm_transactions_actv_code_load_dttm_idx1    ON sam_t.l_csm_transactions USING btree    (actv_code
COLLATEpg_catalog."default" ASC NULLS LAST, (load_dttm::date) DESC NULLS FIRST)    WHERE actv_code::text = ANY
(ARRAY['NAC'::charactervarying, 'CAN'::character varying, 'RSP'::character varying, 'RCL'::character
varying]::text[]);
> 
> /*"Subquery Scan on s  (cost=32023.15..33123.52 rows=129 width=61) (actual time=2.256..2.256 rows=0 loops=1)
>    Filter: (s.rnk = 1)
>    ->  WindowAgg  (cost=32023.15..32799.88 rows=25891 width=61) (actual time=2.255..2.255 rows=0 loops=1)
>          ->  Sort  (cost=32023.15..32087.88 rows=25891 width=53) (actual time=2.254..2.254 rows=0 loops=1)
>                Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm
DESC
>                Sort Method: quicksort  Memory: 25kB
>                ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1449.32..30125.32 rows=25891 width=53) (actual
time=2.247..2.247rows=0 loops=1)
 
>                      Recheck Cond: (((load_dttm)::date >= (CURRENT_DATE - 7)) AND ((actv_code)::text = ANY
('{NAC,CAN,RSP,RCL}'::text[])))
>                      ->  Bitmap Index Scan on l_csm_transactions_actv_code_load_dttm_idx1  (cost=0.00..1442.85
rows=25891width=0) (actual time=2.244..2.245 rows=0 loops=1)
 
>                            Index Cond: ((load_dttm)::date >= (CURRENT_DATE - 7))
>  Planning Time: 0.438 ms
>  Execution Time: 2.303 ms"*/



Re: Single column vs composite partial index

От
Nagaraj Raj
Дата:
You're asking whether to keep one index or the other?

My ask is which index can be used for the mentioned query in production for better IO

It depends on *all* the queries you'll run, not just this one.

I'm more concerned about this specific query, this has been using in one block stored procedure, so it will be run more often on the table. 

explain(ANALYZE, BUFFERS) output: 

"Subquery Scan on s  (cost=32023.15..33123.52 rows=129 width=61) (actual time=2.615..2.615 rows=0 loops=1)"
"  Filter: (s.rnk = 1)"
"  Buffers: shared hit=218"
"  ->  WindowAgg  (cost=32023.15..32799.88 rows=25891 width=61) (actual time=2.614..2.615 rows=0 loops=1)"
"        Buffers: shared hit=218"
"        ->  Sort  (cost=32023.15..32087.88 rows=25891 width=53) (actual time=2.613..2.613 rows=0 loops=1)"
"              Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm DESC"
"              Sort Method: quicksort  Memory: 25kB"
"              Buffers: shared hit=218"
"              ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1449.32..30125.32 rows=25891 width=53) (actual time=2.605..2.605 rows=0 loops=1)"
"                    Recheck Cond: (((load_dttm)::date >= (CURRENT_DATE - 7)) AND ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[])))"
"                    Buffers: shared hit=218"
"                    ->  Bitmap Index Scan on l_csm_transactions_actv_code_load_dttm_idx1  (cost=0.00..1442.85 rows=25891 width=0) (actual time=2.602..2.602 rows=0 loops=1)"
"                          Index Cond: ((load_dttm)::date >= (CURRENT_DATE - 7))"
"                          Buffers: shared hit=218"
"Planning Time: 0.374 ms"
"Execution Time: 2.661 ms"



>The actual performance might change based on thing like maintenance like
>reindex, cluster, vacuum, hardware, and DB state (like cached blocks).

Note: Stats are up to date

> And Postgres version.

PostgreSQL 11.7 running on RedHat 


Thanks,
Rj

On Tuesday, September 15, 2020, 09:18:55 PM PDT, Justin Pryzby <pryzby@telsasoft.com> wrote:


On Tue, Sep 15, 2020 at 10:33:24PM +0000, Nagaraj Raj wrote:
> Hi,
> I'm running one query, and I created two types of index one is composite and the other one with single column one and query planner showing almost the same cost for both index bitmap scan, I'm not sure which is appropriate to keep in production tables.

You're asking whether to keep one index or the other ?
It depends on *all* the queries you'll run, not just this one.
The most general thing to do would be to make multiple, single column indexes,
and let the planner figure out which is best (it might bitmap-AND or -OR them
together).

However, for this query, you can see the 2nd query is actually faster (2ms vs
56ms) - the cost is an estimate based on a model.

The actual performance might change based on thing like maintenance like
reindex, cluster, vacuum, hardware, and DB state (like cached blocks).
And postgres version.

The rowcount estimates are bad.  Maybe you need to ANALYZE the table (or adjust
the autoanalyze thresholds), or evaluate if there's a correlation between
columns.  Bad rowcount estimates beget bad plans and poor performance.

Also: you could use explain(ANALYZE,BUFFERS).
I think the fast plan would be possible with a tiny BRIN index on load_dttm.
(Possibly combined indexes on actv_code or others).
If you also have a btree index on time, then you can CLUSTER on it (and
analyze) and it might improve that plan further (but would affect other
queries, too).


> explain analyze SELECT BAN, SUBSCRIBER_NO, ACTV_CODE, ACTV_RSN_CODE, EFFECTIVE_DATE, TRX_SEQ_NO, LOAD_DTTM, rnk AS RNK  FROM ( SELECT CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE, CT.EFFECTIVE_DATE, CT.TRX_SEQ_NO, CT.LOAD_DTTM, row_number() over (partition by CT.BAN, CT.SUBSCRIBER_NO, CT.ACTV_CODE, CT.ACTV_RSN_CODE order by CT.TRX_SEQ_NO DESC, CT.LOAD_DTTM DESC) rnk FROM SAM_T.L_CSM_TRANSACTIONS CT WHERE CT.ACTV_CODE in ( 'NAC', 'CAN', 'RSP', 'RCL') AND LOAD_DTTM::DATE >= CURRENT_DATE - 7 ) S WHERE RNK = 1

> 1st Index with single column:
> CREATE INDEX l_csm_transactions_load_dttm_idx1    ON sam_t.l_csm_transactions USING btree    (load_dttm ASC NULLS LAST)

>  /*"Subquery Scan on s  (cost=32454.79..33555.15 rows=129 width=61) (actual time=56.473..56.473 rows=0 loops=1)
>    Filter: (s.rnk = 1)
>    ->  WindowAgg  (cost=32454.79..33231.52 rows=25891 width=61) (actual time=56.472..56.472 rows=0 loops=1)
>          ->  Sort  (cost=32454.79..32519.51 rows=25891 width=53) (actual time=56.470..56.470 rows=0 loops=1)
>                Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm DESC
>                Sort Method: quicksort  Memory: 25kB
>                ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1271.13..30556.96 rows=25891 width=53) (actual time=56.462..56.462 rows=0 loops=1)
>                      Recheck Cond: ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[]))
>                      Filter: ((load_dttm)::date >= (CURRENT_DATE - 7))
>                      Rows Removed by Filter: 79137
>                      Heap Blocks: exact=23976
>                      ->  Bitmap Index Scan on l_csm_transactions_actv_code_idx1  (cost=0.00..1264.66 rows=77673 width=0) (actual time=6.002..6.002 rows=79137 loops=1)
>  Planning Time: 0.270 ms
>  Execution Time: 56.639 ms"*/

> 2nd one with composite and partial index:
> CREATE INDEX l_csm_transactions_actv_code_load_dttm_idx1    ON sam_t.l_csm_transactions USING btree    (actv_code COLLATE pg_catalog."default" ASC NULLS LAST, (load_dttm::date) DESC NULLS FIRST)    WHERE actv_code::text = ANY (ARRAY['NAC'::character varying, 'CAN'::character varying, 'RSP'::character varying, 'RCL'::character varying]::text[]);
>
> /*"Subquery Scan on s  (cost=32023.15..33123.52 rows=129 width=61) (actual time=2.256..2.256 rows=0 loops=1)
>    Filter: (s.rnk = 1)
>    ->  WindowAgg  (cost=32023.15..32799.88 rows=25891 width=61) (actual time=2.255..2.255 rows=0 loops=1)
>          ->  Sort  (cost=32023.15..32087.88 rows=25891 width=53) (actual time=2.254..2.254 rows=0 loops=1)
>                Sort Key: ct.ban, ct.subscriber_no, ct.actv_code, ct.actv_rsn_code, ct.trx_seq_no DESC, ct.load_dttm DESC
>                Sort Method: quicksort  Memory: 25kB
>                ->  Bitmap Heap Scan on l_csm_transactions ct  (cost=1449.32..30125.32 rows=25891 width=53) (actual time=2.247..2.247 rows=0 loops=1)
>                      Recheck Cond: (((load_dttm)::date >= (CURRENT_DATE - 7)) AND ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[])))
>                      ->  Bitmap Index Scan on l_csm_transactions_actv_code_load_dttm_idx1  (cost=0.00..1442.85 rows=25891 width=0) (actual time=2.244..2.245 rows=0 loops=1)
>                            Index Cond: ((load_dttm)::date >= (CURRENT_DATE - 7))
>  Planning Time: 0.438 ms
>  Execution Time: 2.303 ms"*/


Re: Single column vs composite partial index

От
Michael Lewis
Дата:
Index Cond: ((load_dttm)::date >= (CURRENT_DATE - 7))"

There is no need to cast the load_dttm field to a date in the query. The plain index on the field would be usable if you skipped that. In your example, you show creating the single column index but it isn't getting used because of the type cast. The second index is both partial, and multi-column. If your data statistics show that ((actv_code)::text = ANY ('{NAC,CAN,RSP,RCL}'::text[])) only 1% of the time, then it would certainly be helpful to have a partial index if those are the rows you want to find often and do so quickly. If the rows with those values for actv_code is more like 75% of the total rows, then there'd be no reason to make it partial IMO.

If you are often/constantly querying for only the last 7-7.999 days of data based on load_dttm, I would put that as the first column of the index since then you would be scanning a contiguous part rather than scanning 3 different parts of the composite index where actv_code = each of those three values, and then finding the rows that are recent based on the timestamp(tz?) field.