Обсуждение: correlation in pg_stats

Поиск
Список
Период
Сортировка

correlation in pg_stats

От
Ron Mayer
Дата:
Short summary:
 * It looks to me like the planner vastly overestimates    the # of pages read by index scan in quite a few of my
tableseven though stats collected by ANALYZE are correct.
 
 * The problem happens any time you have multiple columns   that have a number of repeated values in them, and   you
CLUSTERthe table by a sort using both columns   (like "city,state,zip,phone#" or "firstname,lastname").
 
 * I think this is the problem that Mark Kirkwood is seeing   in his threads Query optimizer 8.0.1 and "One Big trend
vsmultiple smaller trends" in hackers.
 
 * A test script demonstrating the issue also follows.
 * I think keeping one more stat per attribute in   pg_stastic that could describe this behavior.


Longer:  

 If I understand the optimizer correctly,  correlation is used to both guess how much random disk access will be
requiredin a query; as well as estimate how many pages will be read.
 
 Unfortunately, many tables in my larger databases have columns with values that are tightly packed on a few pages;
eventhough there is no total-ordering across the whole table. Stephan Szabo described this as a "clumping effect":
http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php

 The test script below shows a table with 6 columns and  1 million rows.
 Note that ANALYZE correctly observes that the correlation for all the columns except "A" is near zero.
 However, also note that columns A,B,C,and even D will have extremely few UNIQUE values on any given page of data.  And
conversely,an index scan on any of those columns be a good choice (as seen by EXPLAIN ANALYZE output below).
 

 Instead of just storing "correlation",  I would also like to store a variation of "correlation" that "pretends" that
thedisk-pages for a particular column were sorted by the min value for that column.  Wherever the optimizer would use
theexisting 'correlation' to estimate the number of pages that would be accesses; it could use this  "correlation
discountingthe order of blocks" value instead. All the math/stastics/theory that suggests correlation is good at
estimating# of pages would remain intact. Anyway... I've talked too much...
 

The test script showing   1) the tables  2) the values in pg_stats  3) EXPLAIN ANALYZE of columns showing the problem
follows:


********************************************************************************
********************************************************************************
********************************************************************************
********************************************************************************
********************************************************************************

fli=# create temporary table tmp1mil as     select * from            (select generate_series as a from
generate_series(0,9))as a,            (select generate_series as b from generate_series(0,9)) as b,            (select
generate_seriesas c from generate_series(0,9)) as c,            (select generate_series as d from generate_series(0,9))
asd,            (select generate_series as e from generate_series(0,9)) as e,            (select generate_series as f
fromgenerate_series(0,9)) as f     order by a,b,c,d,e,f;
 
fli-# fli-# fli-# fli-# fli-# fli-# fli-# fli-# 
SELECT
fli=# fli=# vacuum analyze tmp1mil;
VACUUM
fli=# select * from pg_stats where tablename='tmp1mil';schemaname | tablename | attname | null_frac | avg_width |
n_distinct|   most_common_vals    |                                       most_common_freqs
          | histogram_bounds | correlation 
 

------------+-----------+---------+-----------+-----------+------------+-----------------------+------------------------------------------------------------------------------------------------+------------------+-------------pg_temp_4
| tmp1mil   | a       |         0 |         4 |         10 | {7,8,5,0,6,3,4,1,2,9} |
{0.113,0.106667,0.105667,0.104333,0.101333,0.097,0.095,0.0936667,0.092,0.0913333}             |                  |
    1pg_temp_4  | tmp1mil   | b       |         0 |         4 |         10 | {1,4,0,5,6,2,9,7,3,8} |
{0.119333,0.112667,0.107,0.101,0.099,0.0976667,0.0946667,0.092,0.0886667,0.088}               |                  |
0.229754pg_temp_4 | tmp1mil   | c       |         0 |         4 |         10 | {9,5,0,1,8,6,4,7,3,2} |
{0.114667,0.107,0.103,0.101667,0.101667,0.0996667,0.0956667,0.094,0.0933333,0.0893333}        |                  |
0.142119pg_temp_4 | tmp1mil   | d       |         0 |         4 |         10 | {4,3,2,8,7,9,5,6,1,0} |
{0.114667,0.11,0.108,0.104,0.102667,0.099,0.098,0.0903333,0.0893333,0.084}                    |                  |
0.0930835pg_temp_4 | tmp1mil   | e       |         0 |         4 |         10 | {0,5,1,9,4,7,8,2,3,6} |
{0.109333,0.109333,0.108333,0.100667,0.100333,0.0983333,0.0966667,0.0936667,0.092,0.0913333}  |                  |
0.138575pg_temp_4 | tmp1mil   | f       |         0 |         4 |         10 | {2,8,6,0,3,5,9,1,7,4} |
{0.104667,0.103667,0.102667,0.102333,0.101667,0.100667,0.100667,0.0986667,0.0943333,0.0906667}|                  |
0.139991
(6 rows)

fli=# create index tmp1mil__c on tmp1mil(c);
CREATE INDEX
fli=# explain analyze select * from tmp1mil where c=5;                                                   QUERY PLAN
                                              
 
------------------------------------------------------------------------------------------------------------------Seq
Scanon tmp1mil  (cost=0.00..19853.00 rows=107001 width=24) (actual time=2.164..480.863 rows=100000 loops=1)  Filter: (c
=5)Total runtime: 531.345 ms
 
(3 rows)

fli=# explain analyze select * from tmp1mil where c=5;                                                   QUERY PLAN
                                              
 
------------------------------------------------------------------------------------------------------------------Seq
Scanon tmp1mil  (cost=0.00..19853.00 rows=107001 width=24) (actual time=2.191..481.849 rows=100000 loops=1)  Filter: (c
=5)Total runtime: 531.858 ms
 
(3 rows)


fli=# set enable_seqscan=false;
SET
fli=# explain analyze select * from tmp1mil where c=5;
QUERYPLAN                                                              
 

--------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing tmp1mil__c on tmp1mil  (cost=0.00..364309.19 rows=107001 width=24) (actual time=0.102..101.460 rows=100000
loops=1) Index Cond: (c = 5)Total runtime: 152.103 ms
 
(3 rows)

fli=# explain analyze select * from tmp1mil where c=5;
QUERYPLAN                                                              
 

--------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing tmp1mil__c on tmp1mil  (cost=0.00..364309.19 rows=107001 width=24) (actual time=0.082..101.298 rows=100000
loops=1) Index Cond: (c = 5)Total runtime: 151.914 ms
 
(3 rows)

fli=# 


Re: correlation in pg_stats

От
pgsql@mohawksoft.com
Дата:
>
> Short summary:
>
>   * It looks to me like the planner vastly overestimates
>     the # of pages read by index scan in quite a few of my
>     tables even though stats collected by ANALYZE are correct.
>
>   * The problem happens any time you have multiple columns
>     that have a number of repeated values in them, and
>     you CLUSTER the table by a sort using both columns
>     (like "city,state,zip,phone#" or "firstname,lastname").
>
>   * I think this is the problem that Mark Kirkwood is seeing
>     in his threads Query optimizer 8.0.1 and "One Big trend
>     vs multiple smaller trends" in hackers.

actually pgsql@mohawksoft.com, is Mark Woodward. Pleased to meet you.
:)

(I hate using my name on lists like this because of spammers)

>
>   * A test script demonstrating the issue also follows.
>
>   * I think keeping one more stat per attribute in
>     pg_stastic that could describe this behavior.
>
>
> Longer:
>
>
>   If I understand the optimizer correctly,  correlation is used
>   to both guess how much random disk access will be required in
>   a query; as well as estimate how many pages will be read.
>
>   Unfortunately, many tables in my larger databases have
>   columns with values that are tightly packed on a few pages;
>   even though there is no total-ordering across the whole table.
>   Stephan Szabo described this as a "clumping effect":
>   http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php

Yes.

I think we are describing the exact same issue.



Re: correlation in pg_stats

От
Mark Kirkwood
Дата:
pgsql@mohawksoft.com wrote:

> actually pgsql@mohawksoft.com, is Mark Woodward. Pleased to meet you.
> :)
> 
> (I hate using my name on lists like this because of spammers)

Not to be confused with me :-)


Re: correlation in pg_stats

От
"Simon Riggs"
Дата:
>M@a@r@k wrote
> > Short summary:
> >
> >   * It looks to me like the planner vastly overestimates
> >     the # of pages read by index scan in quite a few of my
> >     tables even though stats collected by ANALYZE are correct.
> >
> >   * The problem happens any time you have multiple columns
> >     that have a number of repeated values in them, and
> >     you CLUSTER the table by a sort using both columns
> >     (like "city,state,zip,phone#" or "firstname,lastname").
> >
> >   * I think this is the problem that Mark Kirkwood is seeing
> >     in his threads Query optimizer 8.0.1 and "One Big trend
> >     vs multiple smaller trends" in hackers.
>
> actually pgsql@mohawksoft.com, is Mark Woodward. Pleased to meet you.
> :)
>
> (I hate using my name on lists like this because of spammers)
>
> >
> >   * A test script demonstrating the issue also follows.
> >
> >   * I think keeping one more stat per attribute in
> >     pg_stastic that could describe this behavior.
> >
> >
> > Longer:
> >
> >
> >   If I understand the optimizer correctly,  correlation is used
> >   to both guess how much random disk access will be required in
> >   a query; as well as estimate how many pages will be read.
> >
> >   Unfortunately, many tables in my larger databases have
> >   columns with values that are tightly packed on a few pages;
> >   even though there is no total-ordering across the whole table.
> >   Stephan Szabo described this as a "clumping effect":
> >
> http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php
>
> Yes.
>
> I think we are describing the exact same issue.
>

I believe that is a very common case.

Many natural value distributions follow the Zipf distribution, which is
commonly known as the 80-20 curve.

Best Regards, SImon Riggs