[GENERAL] Performance with high correlation in group by on PK

Поиск
Список
Период
Сортировка
От Alban Hertroys
Тема [GENERAL] Performance with high correlation in group by on PK
Дата
Msg-id CAF-3MvPrZfO-vEfvrSTwEfhvgG1Ft7HFF5JoGZYa+f7wJYWskg@mail.gmail.com
обсуждение исходный текст
Ответы Re: [GENERAL] Performance with high correlation in group by on PK
Re: [GENERAL] Performance with high correlation in group by on PK
Список pgsql-general
Hi all,

It's been a while since I actually got to use PG for anything serious,
but we're finally doing some experimentation @work now to see if it is
suitable for our datawarehouse. So far it's been doing well, but there
is a particular type of query I run into that I expect we will
frequently use and that's choosing a sequential scan - and I can't
fathom why.

This is on:


The query in question is:
select "VBAK_MANDT", max("VBAK_VBELN")
  from staging.etl00001_vbak
 group by "VBAK_MANDT";

This is the header-table for another detail table, and in this case
we're already seeing a seqscan. The thing is, there are 15M rows in
the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
"VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
we only have 1 at the moment!).

Explain analyze says the following about this query:
warehouse=# explain (analyze, buffers) select "VBAK_MANDT",
max("VBAK_VBELN") from staging.etl00001_vbak group by "VBAK_MANDT";
                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1990054.08..1990054.09 rows=1 width=36) (actual
time=38723.602..38723.602 rows=1 loops=1)
   Group Key: "VBAK_MANDT"
   Buffers: shared hit=367490 read=1409344
   ->  Seq Scan on etl00001_vbak  (cost=0.00..1918980.72 rows=14214672
width=15) (actual time=8.886..31317.283 rows=14214672 loops=1)
         Buffers: shared hit=367490 read=1409344
 Planning time: 0.126 ms
 Execution time: 38723.674 ms
(7 rows)

As you can see, a sequential scan. The statistics seem quite correct.

If I change the query to select a single value of "VBAK_MANDT" we get:
warehouse=# explain (analyze, buffers) select max("VBAK_VBELN") from
staging.etl00001_vbak where "VBAK_MANDT" = '300';

           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.37..1.38 rows=1 width=32) (actual time=14.911..14.911
rows=1 loops=1)
   Buffers: shared hit=2 read=3
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.56..1.37 rows=1 width=11) (actual
time=14.907..14.908 rows=1 loops=1)
           Buffers: shared hit=2 read=3
           ->  Index Only Scan Backward using etl00001_vbak_pkey on
etl00001_vbak  (cost=0.56..11498362.31 rows=14214672 width=11) (actual
time=14.906..14.906 rows=1 loops=1)
                 Index Cond: (("VBAK_MANDT" = '300'::bpchar) AND
("VBAK_VBELN" IS NOT NULL))
                 Heap Fetches: 1
                 Buffers: shared hit=2 read=3
 Planning time: 0.248 ms
 Execution time: 14.945 ms
(11 rows)

That is more in line with my expectations.

Oddly enough, adding "MANDT_VBAK" and the group by back into that last
query, the result is a seqscan again.

For "VBAK_MANDT" we see these statistics:
Null fraction: 0
Average width: 4
Distinct values: 1
Most common values: {300}
Most common frequencies: {1}
Histogram bounds :
Correlation :1

The table definition is:
          Table "staging.etl00001_vbak"
       Column        |     Type      | Modifiers
---------------------+---------------+-----------
 VBAK_MANDT          | character(3)  | not null
 VBAK_VBELN          | character(10) | not null
 VBAK_ERDAT          | date          | not null
 VBAK_ERZET          | character(6)  | not null
 VBAK_ERNAM          | character(12) | not null
 VBAK_ANGDT          | date          | not null
 VBAK_BNDDT          | date          | not null
 VBAK_AUDAT          | date          | not null
...
VBAK_MULTI          | character(1)  | not null
 VBAK_SPPAYM         | character(2)  | not null
Indexes:
    "etl00001_vbak_pkey" PRIMARY KEY, btree ("VBAK_MANDT", "VBAK_VBELN")
    "idx_etl00001_vbak_erdat" btree ("VBAK_ERDAT")

A final remark: The table definition was generated by our
reporting/ETL software, hence the case-sensitive column names and the
use of the character type instead of varchar (or text).

What can be done to help the planner choose a smarter plan?
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


В списке pgsql-general по дате отправления:

Предыдущее
От: Christoph Moench-Tegeder
Дата:
Сообщение: Re: [GENERAL] Logging the fact that a log was shipped
Следующее
От: Alban Hertroys
Дата:
Сообщение: Re: [GENERAL] Performance with high correlation in group by on PK