Обсуждение: Optimizer misses big in 10.4 with BRIN index

Поиск
Список
Период
Сортировка

Optimizer misses big in 10.4 with BRIN index

От
Arcadiy Ivanov
Дата:
Hi,

Before spamming the list with reproduction examples I want to make sure
the issue isn't already known.

Moving to 10.4 from 9.2 (AWS RDS but repro on local laptop as well)
we've discovered that an optimizer prefers a seq scan to fully analyzed
consistent BRIN index, ending up with a query that is 4.8s long on
seqscan vs 56ms when forcing use of BRIN (85 times difference).
The size of the dataset is millions of rows and with extremely high
probability the rows are naturally clustered on BRIN index column.

Anybody observed anything like that?

========================================================
schema0=# SET enable_seqscan=false;
SET
schema0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT data,
count(*) OVER() AS full_count FROM schema0_lab.data_table WHERE segment
= 'pb1'
                     AND (data->>'tradeDate')::numeric >= '1531267200'
                     AND (data->>'tradeDate')::numeric <= '1531353600'
                     AND data->>'tradeStatus' = 'Replaced'
                     ORDER BY (data->>'tradeDate')::numeric
                     DESC;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=617851.03..617851.24 rows=84 width=1219) (actual
time=55.765..55.794 rows=611 loops=1)
    Output: data, (count(*) OVER (?)), (((data ->>
'tradeDate'::text))::numeric)
    Sort Key: (((data_table.data ->> 'tradeDate'::text))::numeric) DESC
    Sort Method: quicksort  Memory: 1256kB
    Buffers: shared hit=824
    ->  WindowAgg  (cost=1231.98..617848.34 rows=84 width=1219) (actual
time=52.688..55.068 rows=611 loops=1)
          Output: data, count(*) OVER (?), ((data ->>
'tradeDate'::text))::numeric
          Buffers: shared hit=824
          ->  Gather  (cost=1231.98..617846.66 rows=84 width=1179)
(actual time=8.247..51.804 rows=611 loops=1)
                Output: data
                Workers Planned: 2
                Workers Launched: 2
                Buffers: shared hit=824
                ->  Parallel Bitmap Heap Scan on schema0_lab.data_table 
(cost=231.98..616838.26 rows=35 width=1179) (actual time=3.850..46.704
rows=204 loops=3)
                      Output: data
                      Recheck Cond: ((((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data ->> 'tradeDate'::text))::numeric <=
'1531353600'::numeric))
                      Rows Removed by Index Recheck: 4404
                      Filter: (((data_table.segment)::text =
'pb1'::text) AND ((data_table.data ->> 'tradeStatus'::text) =
'Replaced'::text))
                      Heap Blocks: lossy=794
                      Buffers: shared hit=2334
                      Worker 0: actual time=3.572..44.145 rows=236 loops=1
                        Buffers: shared hit=749
                      Worker 1: actual time=0.326..45.184 rows=212 loops=1
                        Buffers: shared hit=761
                      ->  Bitmap Index Scan on tradedate_idx
(cost=0.00..231.96 rows=3377106 width=0) (actual time=4.500..4.500
rows=23040 loops=1)
                            Index Cond: ((((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data ->> 'tradeDate'::text))::numeric <=
'1531353600'::numeric))
                            Buffers: shared hit=30
  Planning time: 0.246 ms
  Execution time: 56.209 ms
(29 rows)

schema0=# SET enable_seqscan=true;
SET
schema0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT data,
count(*) OVER() AS full_count FROM schema0_lab.data_table WHERE segment
= 'pb1'
                     AND (data->>'tradeDate')::numeric >= '1531267200'
                     AND (data->>'tradeDate')::numeric <= '1531353600'
                     AND data->>'tradeStatus' = 'Replaced'
                     ORDER BY (data->>'tradeDate')::numeric
                     DESC;
QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
  Sort  (cost=617619.05..617619.26 rows=84 width=1219) (actual
time=4823.081..4823.106 rows=611 loops=1)
    Output: data, (count(*) OVER (?)), (((data ->>
'tradeDate'::text))::numeric)
    Sort Key: (((data_table.data ->> 'tradeDate'::text))::numeric) DESC
    Sort Method: quicksort  Memory: 1256kB
    Buffers: shared hit=839 read=187353
    ->  WindowAgg  (cost=1000.00..617616.36 rows=84 width=1219) (actual
time=4820.005..4822.390 rows=611 loops=1)
          Output: data, count(*) OVER (?), ((data ->>
'tradeDate'::text))::numeric
          Buffers: shared hit=839 read=187353
          ->  Gather  (cost=1000.00..617614.68 rows=84 width=1179)
(actual time=3.262..4819.362 rows=611 loops=1)
                Output: data
                Workers Planned: 2
                Workers Launched: 2
                Buffers: shared hit=839 read=187353
                ->  Parallel Seq Scan on schema0_lab.data_table
(cost=0.00..616606.28 rows=35 width=1179) (actual time=6.135..4814.826
rows=204 loops=3)
                      Output: data
                      Filter: (((data_table.segment)::text =
'pb1'::text) AND ((data_table.data ->> 'tradeStatus'::text) =
'Replaced'::text) AND (((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data
  ->> 'tradeDate'::text))::numeric <= '1531353600'::numeric))
                      Rows Removed by Filter: 1125498
                      Buffers: shared hit=2400 read=561439
                      Worker 0: actual time=11.414..4812.744 rows=220
loops=1
                        Buffers: shared hit=775 read=187333
                      Worker 1: actual time=4.249..4813.264 rows=220 loops=1
                        Buffers: shared hit=786 read=186753
  Planning time: 0.232 ms
  Execution time: 4823.412 ms
(24 rows)

schema0=# \d schema0_lab.data_table

                                          Table "schema0_lab.data_table"
    Column   |          Type          | Collation | Nullable
|                     Default
------------+------------------------+-----------+----------+--------------------------------------------------
  id         | integer                |           | not null |
nextval('schema0_lab.data_table_id_seq'::regclass)
  address    | character varying(128) |           | not null |
  segment    | character varying(128) |           | not null |
  data       | jsonb                  |           | not null |
Indexes:
     "data_table_pkey" PRIMARY KEY, btree (id)
     "tradedate_idx" brin (((data ->> 'tradeDate'::text)::numeric)) WITH
(autosummarize='true')

--
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov


Re: Optimizer misses big in 10.4 with BRIN index

От
Tomas Vondra
Дата:
Hi,

On 07/25/2018 03:58 PM, Arcadiy Ivanov wrote:
>                       ->  Bitmap Index Scan on tradedate_idx 
> (cost=0.00..231.96 rows=3377106 width=0) (actual time=4.500..4.500 
> rows=23040 loops=1)
>                             Index Cond: ((((data_table.data ->> 
> 'tradeDate'::text))::numeric >= '1531267200'::numeric) AND 
> (((data_table.data ->> 'tradeDate'::text))::numeric <= 
> '1531353600'::numeric))

My guess is this is the root cause - the estimated number of rows is 
much higher than in practice (3377106 vs. 23040), so at the end the 
seqscan is considered to be slightly cheaper and wins. But the actual 
row count is ~150x lower, making the bitmap index scan way faster.

IMHO you'll need to find a way to improve the estimates, which may be 
difficult. The first thing I'd try is creating an expression index on 
the expression you use in the WHERE clause. Something like

     CREATE INDEX ON data_table (((data_table.data ->> 
'tradeDate'::text))::numeric);

And then ANALYZE the table again ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Optimizer misses big in 10.4 with BRIN index

От
David Rowley
Дата:
On 26 July 2018 at 04:50, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> My guess is this is the root cause - the estimated number of rows is much
> higher than in practice (3377106 vs. 23040), so at the end the seqscan is
> considered to be slightly cheaper and wins. But the actual row count is
> ~150x lower, making the bitmap index scan way faster.

Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
in bringetbitmap()?

The BRIN index costing was changed quite dramatically in [1] which
first appears in pg10. Previous to that patch BRIN assumed it would
touch total_pages * pred_selectivity blocks in the bitmap scan.  That
worked well when the table happened to be perfectly ordered by the
column of the BRIN index, but was pretty broken when the order was
random.  That lead to BRIN indexes being used when they really
shouldn't have been.  The patch tried to fix that by using what
information it could. That was basically the correlation statistics
from pg_statistic.   It appeared that the patch was producing more
realistic plans than unpatched, so we went with it, but it's
definitely not perfect; what statistics are?

It would be quite interesting to know the result of:

select stakind3 from pg_Statistic where starelid =
'schema0_lab.data_table'::regclass;

I also see the estimated costs of either plan are very close, so the
chosen plan may well be quite susceptible to changing after different
ANALYZE runs on the table.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7e534adcdc70866e7be74d626b0ed067c890a251

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Optimizer misses big in 10.4 with BRIN index

От
Emre Hasegeli
Дата:
> Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
> in bringetbitmap()?

Yes, it is just confusing.  The correct value is on one level up of
the tree.  It is 204 + 4404 rows removed by index recheck = 4608, so
the estimate is not only 150x but 733x off :(.

The sequential scan plan shows 204 + 1125498 rows removed by filter =
1125702 as the actual table size.  However the former plan estimates
to get 3377106 rows from the index.  That is 3x of the table size.
The selectivity estimation cannot be greater than 1.  If I am not
missing anything, the general statistics of this table should be
seriously outdated.


Re: Optimizer misses big in 10.4 with BRIN index

От
Tomas Vondra
Дата:
On 07/26/2018 10:11 AM, Emre Hasegeli wrote:
>> Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
>> in bringetbitmap()?
> 
> Yes, it is just confusing.  The correct value is on one level up of
> the tree.  It is 204 + 4404 rows removed by index recheck = 4608, so
> the estimate is not only 150x but 733x off :(.
> 
> The sequential scan plan shows 204 + 1125498 rows removed by filter =
> 1125702 as the actual table size.  However the former plan estimates
> to get 3377106 rows from the index.  That is 3x of the table size.
> The selectivity estimation cannot be greater than 1.  If I am not
> missing anything, the general statistics of this table should be
> seriously outdated.
> 

Hmmm, yeah. It's s bot confusing, and the parallel plan does not improve 
the situation either :-(

Arcadiy, can you provide plans with parallel query disabled? Or even 
better, produce a test case that reproduces this (using synthetic data, 
anonymized data or something like that, if needed).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Optimizer misses big in 10.4 with BRIN index

От
Arcadiy Ivanov
Дата:
On 07/26/2018 07:27 AM, Tomas Vondra wrote:
> Arcadiy, can you provide plans with parallel query disabled? Or even
> better, produce a test case that reproduces this (using synthetic
> data, anonymized data or something like that, if needed).
So I basically spent most of the time trying to create a reproducible
case and I can say I failed. I can however reproduce this with specific
large data set with specific data distribution, but not an artificial
one. Unfortunately data is restricted so I have to obfuscate the data
structures. That said, I am able to reproduce this sporadically on local
machine after exporting from RDS 10.4. On some days reproduction happens
always, on others it's hard to reproduce at all. So here it goes in
sequential order:



** Data is imported into database
======================
db0=# \d+ schema0.table0
                                          Table "schema0.table0"
    Column   |          Type          | Collation | Nullable | Default |
Storage  | Stats target | Description
------------+------------------------+-----------+----------+---------+----------+--------------+-------------
  field1     | character varying(128) |           | not null |         |
extended |              |
  field2     | character varying(128) |           | not null |         |
extended |              |
  field3     | character varying(128) |           | not null |         |
extended |              |
  field4     | bigint                 |           | not null |         |
plain    |              |
  field5     | bigint                 |           | not null |         |
plain    |              |
  field6     | bigint                 |           | not null |         |
plain    |              |
  data       | jsonb                  |           | not null |         |
extended |              |
Indexes:
     "date_idx" brin (((data ->> 'date'::text)::numeric)) WITH
(autosummarize='true')
Options: autovacuum_vacuum_scale_factor=0.0,
autovacuum_vacuum_threshold=5000, autovacuum_analyze_scale_factor=0.0,
autovacuum_analyze_threshold=5000, fillfactor=75


** Data distribution in the table
======================
db0# SELECT (data->>'date')::numeric as date, count(1) FROM
schema0.table0 GROUP BY date ORDER BY date;
     date    |  count
------------+---------
  1527580800 |       5
  1527753600 |      72
  1527840000 |      36
  1528012800 |      18
  1528099200 |      72
  1528185600 |    7266
  1528272000 |     336
  1528358400 |     230
  1528444800 |      90
  1528704000 |      76
  1528790400 |       4
  1528876800 |    9060
  1528934400 |       6
  1528963200 |      33
  1529193600 |       6
  1529280000 |      18
  1529294400 |  541344
  1529380800 | 1113121
  1529467200 |  794082
  1529553600 |  604752
  1529812800 |  135252
  1529899200 |   63222
  1529985600 |   31134
  1530072000 |   25392
  1530158400 |      90
  1530417600 |      48
  1530504000 |     144
  1530518400 |       1
  1530590400 |    1958
  1530604800 |       1
  1530763200 |   48614
  1530849600 |       9
  1531022400 |       3
  1531108800 |       3
  1531195200 |       1
*** This is the range we're interested in
  1531281600 |    3713
*******
  1531627200 |     300
  1531713600 |    1113
  1531800000 |      30
  1531886400 |   36000
  1532232000 |   24900
  1532577600 |  409364
  1532664000 |    2050
  1532836800 |  334782
  1532923200 |    6771
(45 rows)

** No analysis was run after import. Let's see how we do.
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=1102962.86..1103079.54 rows=1000 width=38) (actual
time=7234.091..7234.697 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=9 read=350293
    ->  Gather Merge  (cost=1102962.86..1105002.57 rows=17482 width=38)
(actual time=7234.089..7234.589 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=9 read=350293
          ->  Sort  (cost=1101962.84..1101984.69 rows=8741 width=38)
(actual time=7227.938..7227.985 rows=964 loops=3)
                Output: ctid, (((data ->> 'date'::text))::numeric)
                Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
                Sort Method: quicksort  Memory: 102kB
                Buffers: shared hit=144 read=1048851
                Worker 0: actual time=7223.770..7223.820 rows=1272 loops=1
                  Buffers: shared hit=68 read=348293
                Worker 1: actual time=7226.520..7226.589 rows=1299 loops=1
                  Buffers: shared hit=67 read=350265
                ->  Parallel Seq Scan on schema0.table0
(cost=0.00..1101390.58 rows=8741 width=38) (actual time=68.862..7227.366
rows=1238 loops=3)
                      Output: ctid, ((data ->> 'date'::text))::numeric
                      Filter: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Rows Removed by Filter: 1397270
                      Buffers: shared hit=32 read=1048849
                      Worker 0: actual time=50.494..7223.107 rows=1272
loops=1
                        Buffers: shared hit=11 read=348293
                      Worker 1: actual time=53.408..7225.849 rows=1299
loops=1
                        Buffers: shared hit=12 read=350263
  Planning time: 2.096 ms
  Execution time: 7235.867 ms
(28 rows)

** Ok, no statistics, Par Seq Scan is selected, it's expected.
=======================
db0=# VACUUM ANALYZE VERBOSE schema0.table0;
INFO:  vacuuming "schema0.table0"
INFO:  index "date_idx" now contains 8194 row versions in 45 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 out of 1048881 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.36 s, system: 4.84 s, elapsed: 7.48 s.
INFO:  vacuuming "pg_toast.pg_toast_7888880"
INFO:  index "pg_toast_7888880_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_7888880": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
VACUUM

** Statistics are collected. Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.674..7498.791 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=74 read=411764
    ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.672..7498.731 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=74 read=411764
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=58.322..7497.656 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1643390
                Heap Blocks: lossy=411776
                Buffers: shared hit=74 read=411764
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=40.243..40.243 rows=4117760 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=62
  Planning time: 1.388 ms
  Execution time: 7498.907 ms
(19 rows)

** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
almost sequential! Let's take a look at it.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
    ctid    |    date
-----------+------------
  (4456,2)  | 1531281600
  (4468,2)  | 1531281600
  (9301,2)  | 1531281600
  (9310,3)  | 1531281600
...
  (9317,4)  | 1531281600
  (9325,3)  | 1531281600
  (9325,4)  | 1531281600
  (9326,1)  | 1531281600
  (9328,1)  | 1531281600
  (9328,2)  | 1531281600
  (9345,1)  | 1531281600
...
  (9365,2)  | 1531281600
  (9386,2)  | 1531281600
  (9386,3)  | 1531281600
  (9386,4)  | 1531281600
  (9418,4)  | 1531281600
...
  (9419,4)  | 1531281600
  (12921,2) | 1531281600
...
  (12962,2) | 1531281600
  (12983,1) | 1531281600
  (12983,2) | 1531281600
  (12983,3) | 1531281600
  (12983,4) | 1531281600
  (13028,3) | 1531281600
  (13028,4) | 1531281600
  (14113,1) | 1531281600
  (14113,2) | 1531281600
  (15297,1) | 1531281600
...
  (15317,2) | 1531281600
  (15350,1) | 1531281600
  (15350,2) | 1531281600
  (15350,3) | 1531281600
  (15350,4) | 1531281600
  (15386,3) | 1531281600
  (15386,4) | 1531281600
  (17620,3) | 1531281600
...
  (17648,4) | 1531281600
  (17652,1) | 1531281600
...
  (17653,2) | 1531281600
  (17658,1) | 1531281600
...
  (17659,2) | 1531281600
  (21229,2) | 1531281600
  (21243,1) | 1531281600
...
  (21246,4) | 1531281600
  (35794,2) | 1531281600
...
  (35798,1) | 1531281600
  (35827,4) | 1531281600
...
  (35887,2) | 1531281600
  (37006,3) | 1531281600
...
  (37080,1) | 1531281600
  (37157,2) | 1531281600
  (37157,3) | 1531281600
  (37157,4) | 1531281600
  (39475,1) | 1531281600
...
  (39552,4) | 1531281600
  (39623,4) | 1531281600
  (39653,1) | 1531281600
  (39653,2) | 1531281600
  (39653,3) | 1531281600
  (39653,4) | 1531281600
  (39674,1) | 1531281600
  (39674,2) | 1531281600
  (39674,3) | 1531281600
  (39674,4) | 1531281600
  (40829,3) | 1531281600
  (40829,4) | 1531281600
  (41946,4) | 1531281600
...
  (41962,2) | 1531281600
(1000 rows)

** Ok, fine, there are large gaps. Let's make **absolutely sure** the
data is sequential.
=======================
db0=# CREATE INDEX date_ord_idx ON schema0.table0 (((table0.data ->>
'date')::numeric));
CREATE INDEX

** Not running any analysis on a new index, shouldn't matter if stats
are wrong, expecting scan and sort.
=======================
db0=# CLUSTER VERBOSE schema0.table0 USING date_ord_idx;
INFO:  clustering "schema0.table0" using sequential scan and sort
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 873.12 s, system: 19.18 s, elapsed: 1019.36 s.
CLUSTER

** Ok, now we're clustered. Dropping order b-tree index.
=======================
db0=# DROP INDEX schema0.date_ord_idx ;
DROP INDEX

** Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.584..7339.703 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=56 read=410752
    ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.583..7339.634 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=56 read=410752
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=59.252..7338.635 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1639294
                Heap Blocks: lossy=410752
                Buffers: shared hit=56 read=410752
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=24.545..24.545 rows=4107520 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=56
  Planning time: 0.515 ms
  Execution time: 7339.772 ms
(19 rows)

**  "Rows Removed by Index Recheck: 1639294" - OK, statistics are
pooched post-clustering. Let's reanalyze.
=======================
db0=# ANALYZE VERBOSE schema0.table0;
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
ANALYZE


**  Let's try this again...
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.824..6533.927 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=410808
    ->  Sort  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.822..6533.862 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=410808
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.22..5419.03
rows=555 width=38) (actual time=59.316..6532.913 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1639294
                Heap Blocks: lossy=410752
                Buffers: shared hit=410808
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.08
rows=1339 width=0) (actual time=30.859..30.859 rows=4107520 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=56
  Planning time: 1.100 ms
  Execution time: 6534.026 ms
(19 rows)

** This cannot be possible! The table must be clustered on the same
expression we cluster BRIN! Let's look at the data.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
    ctid    |    date
-----------+------------
  (4155,2)  | 1531281600
...
  (4159,1)  | 1531281600
  (4188,4)  | 1531281600
  (4189,1)  | 1531281600
  (4189,2)  | 1531281600
  (4189,3)  | 1531281600
  (4189,4)  | 1531281600
  (4231,4)  | 1531281600
  (4247,3)  | 1531281600
  (4247,4)  | 1531281600
  (4248,1)  | 1531281600
  (4248,2)  | 1531281600
  (5367,3)  | 1531281600
...
  (5368,4)  | 1531281600
  (5376,3)  | 1531281600
...
  (5441,1)  | 1531281600
  (5518,2)  | 1531281600
  (5518,3)  | 1531281600
  (5518,4)  | 1531281600
  (7836,1)  | 1531281600
  (7836,2)  | 1531281600
  (7851,3)  | 1531281600
...
  (7913,4)  | 1531281600
  (7984,4)  | 1531281600
  (8014,1)  | 1531281600
  (8014,2)  | 1531281600
  (8014,3)  | 1531281600
  (8014,4)  | 1531281600
  (8035,1)  | 1531281600
  (8035,2)  | 1531281600
  (8035,3)  | 1531281600
  (8035,4)  | 1531281600
  (9190,3)  | 1531281600
  (9190,4)  | 1531281600
  (10307,4) | 1531281600
...
  (10349,3) | 1531281600
  (10371,3) | 1531281600
...
  (10392,2) | 1531281600
  (10445,4) | 1531281600
  (10446,1) | 1531281600
  (10446,2) | 1531281600
  (11580,3) | 1531281600
...
  (11586,2) | 1531281600
  (11595,3) | 1531281600
...
  (11598,2) | 1531281600
  (11600,1) | 1531281600
...
  (11601,2) | 1531281600
  (11609,1) | 1531281600
...
  (11621,2) | 1531281600
  (11621,3) | 1531281600
  (11664,3) | 1531281600
...
  (11684,3) | 1531281600
  (11726,2) | 1531281600
...
  (11770,4) | 1531281600
  (12891,2) | 1531281600
  (14057,1) | 1531281600
...
  (14058,2) | 1531281600
  (14064,3) | 1531281600
...
  (14071,3) | 1531281600
(1000 rows)

****

The issue seems to be with both clustering and BRIN stats.

Fillfactor on the table is not a factor, can be 100 can be 75, I'm able
to reproduce and fail to reproduce with both.
Clustering on `date_ord_idx` can be done with ANALYZE and without, in
some cases PG picks scan and sort, sometimes it uses index regardless of
stats.
Sometimes the table is properly clustered, sometimes it isn't
afterwards. Sometimes if you cluster first on the unanalyzed
`date_ord_idx`, clustering doesn't fix the problem, but then analyzing
and clustering on `date_ord_idx` again does fix the issue. Sometimes
neither helps.
Full vacuum analyze may help, or may not help, depending on the phase of
the moon.
When BRIN is actually fixed, the explain has a few hundred rows
eliminated by the recheck, not >1M, and queries are 50 - 70 ms long for
date range.

If anybody can give a suggestion on where in the code to start looking,
I'll be extra-grateful.

--
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov


Re: Optimizer misses big in 10.4 with BRIN index

От
Emre Hasegeli
Дата:
> So I basically spent most of the time trying to create a reproducible case
> and I can say I failed. I can however reproduce this with specific large
> data set with specific data distribution, but not an artificial one.

The query plans posted that has the statistics prefer Bitmap Index
Scan.  This is not reproduction of the originally posted case.

> ** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
> almost sequential! Let's take a look at it.

I don't think it has anything to do with query planning.  Have you
tried "pages_per_range" option of BRIN?