Re: Partial index plan/cardinality costing

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: Partial index plan/cardinality costing
Дата
Msg-id 20181009004944.GO871@telsasoft.com
обсуждение исходный текст
Ответ на Partial index plan/cardinality costing  (James Coleman <jtc331@gmail.com>)
Список pgsql-performance
Please don't cross-post to lists.

>insert into s(status, action_at, m_fk)
>select
>  ( CASE WHEN series.n % 100 < 80 THEN
>      (ARRAY['E', 'C'])[(series.n % 2) + 1]
>    ELSE
>      (ARRAY['P', 'PD', 'A'])[((random() * 3)::integer % 3) + 1]
>    END
>  ),
>  (
>    CASE WHEN series.n % 100 < 80 THEN
>      '2018-09-07'::date + ((series.n % 365 - 365)::text || ' day')::interval
>    ELSE
>      '2018-09-07'::date + (((random() * 30)::integer % 30 - 4)::text || ' day')::interval
>    END
>  ),
>  (select m.pk from m limit 1)
>from generate_series(1, 500000) series(n);

> I have two partial indexes:
> - s_pk_action_at on s(pk, action_at) where status in ('P', 'PD', 'A')
> - s_action_at_pk on s(action_at, pk) where status in ('P', 'PD', 'A')

> - How is s_pk_action_at ever efficient to scan? Given that the highest
> cardinality (primary key) column is first, wouldn't an index scan
> effectively have to scan the entire index?

The index probably IS inefficient to scan (you could see that if you force an
bitmap index scan on s_pk_action_at)...but because of leading pkey column, the
HEAP is read sequentially, and the planner knows that the heap will be read in
order of its leading column.  Reading the entire index is less expensive than
reading most of the table (maybe nonsequentially).  This is the 2nd effect Jeff
Janes likes to point out: high correlation means 1) sequential reads; *and*, 2)
a smaller fraction of the table needs to be accessed to read a given number of
tuples.

> - Why does index scan on s_action_at_pk reads over 2x as many blocks as the
> bitmap heap scan with the same index?

Maybe because of heap pages accessed multiple times (not sequentially), since
correlation is small on this table loaded with "modulus"-style insertions.

pryzbyj=# SELECT attname, correlation FROM pg_stats WHERE tablename='s' ;
  attname  | correlation 
-----------+-------------
 pk        |           1
 status    |    0.340651
 action_at |  0.00224239
 m_fk      |           1

..so each index tuple is accessing a separate heap page.

If you create non-partial index and CLUSTER on action_at_idx, then:

pryzbyj=# SELECT attname, correlation FROM pg_stats WHERE tablename='s' ;
  attname  | correlation
-----------+-------------
 pk        |  0.00354867
 status    |    0.420806
                                            action_at |           1
 
 m_fk      |           1

 Nested Loop  (cost=1907.03..6780.65 rows=11038 width=8) (actual time=2.241..17.839 rows=8922 loops=1)
   Join Filter: (s.m_fk = m.pk)
   Buffers: shared hit=115 read=53
   ->  Seq Scan on m  (cost=0.00..1.01 rows=1 width=8) (actual time=0.009..0.011 rows=1 loops=1)
         Filter: (status = 'A'::text)
         Buffers: shared hit=1
   ->  Bitmap Heap Scan on s  (cost=1907.03..6641.66 rows=11038 width=16) (actual time=2.222..9.032 rows=8922 loops=1)
         Recheck Cond: ((action_at <= '2018-09-06'::date) AND (status = ANY ('{P,PD,A}'::text[])))
         Filter: (status = ANY ('{A,P}'::text[]))
         Rows Removed by Filter: 4313
         Heap Blocks: exact=114
         Buffers: shared hit=114 read=53
         ->  Bitmap Index Scan on s_action_at_pk  (cost=0.00..1904.27 rows=82647 width=0) (actual time=2.185..2.186
rows=13235loops=1)
 
               Index Cond: (action_at <= '2018-09-06'::date)
               Buffers: shared read=53

Also, I don't think it matters here, but action_at and status are correlated.
Planner would think that they're independent.

I don't think it's related to other issues, but also note the rowcount estimate is off:
         ->  Bitmap Index Scan on s_action_at_pk  (cost=0.00..1258.02 rows=82347 width=0) (actual time=1.026..1.026
rows=13402loops=1)                             
 
               Index Cond: (action_at <= '2018-09-06'::date)
                                           
 

Justin



В списке pgsql-performance по дате отправления:

Предыдущее
От: James Coleman
Дата:
Сообщение: Re: Partial index plan/cardinality costing
Следующее
От: Vladimir Ryabtsev
Дата:
Сообщение: Re: Why could different data in a table be processed with different performance?