Обсуждение: help understanding create statistic

Поиск
Список
Период
Сортировка

help understanding create statistic

От
Luca Ferrari
Дата:
Hi all,
in order to better understand this feature of 10, I've created a table
like this:

CREATE TABLE expenses(
   ...
   day date,
   year int,
   CHECK( year = EXTRACT( year FROM day ) )
);

so that I can ensure 'year' and 'day' are tied together:

SELECT
count(*) FILTER( WHERE year = 2016 ) AS by_year,
count(*) FILTER( WHERE EXTRACT( year FROM day ) = 2016 ) AS by_day
FROM expenses;
-[ RECORD 1 ]-
by_year | 8784
by_day  | 8784

Then I created a statistic:

CREATE STATISTICS stat_day_year ( dependencies )
ON day, year
FROM expenses;

select * from pg_statistic_ext ;
-[ RECORD 1 ]---+---------------------
stxrelid        | 42833
stxname         | stat_day_year
stxnamespace    | 2200
stxowner        | 16384
stxkeys         | 3 5
stxkind         | {f}
stxndistinct    |
stxdependencies | {"3 => 5": 1.000000}

Now, having an index on the extract year of day as follows:

CREATE INDEX idx_year
ON expenses ( EXTRACT( year FROM day ) );

why is the planner not choosing to use such index on a 'year' raw query?

EXPLAIN SELECT * FROM expenses
WHERE year = 2016;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather  (cost=1000.00..92782.34 rows=8465 width=32)
   Workers Planned: 2
   ->  Parallel Seq Scan on expenses  (cost=0.00..90935.84 rows=3527 width=32)
         Filter: (year = 2016)

The number of rows are correct, but I was expecting it to use the same
index as a query like "WHERE EXTRACT( year FROM day) = 2016" triggers.

Even altering the year column to not null does show any change, and
this is the plan obtained turning off seq_scan (to see the costs):

EXPLAIN ANALYZE SELECT * FROM expenses
WHERE year = 2016;
                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on expenses  (cost=10000000000.00..10000127402.44 rows=8451
width=32) (actual time=972.734..2189.300 rows=8784 loops=1)
   Filter: (year = 2016)
   Rows Removed by Filter: 4991216


Am I misunderstaing this functional dependency?

Thanks,
Luca


Re: help understanding create statistic

От
David Rowley
Дата:
On 28 June 2018 at 21:38, Luca Ferrari <fluca1978@gmail.com> wrote:
> CREATE INDEX idx_year
> ON expenses ( EXTRACT( year FROM day ) );
>
> why is the planner not choosing to use such index on a 'year' raw query?
>
> EXPLAIN SELECT * FROM expenses
> WHERE year = 2016;

The expression in the where clause must match the indexed expression.
You'd need to add an index on just (year) for that to work.

> Am I misunderstaing this functional dependency?

Yeah, the statistics are just there to drive the planner's costing.
They won't serve as proof for anything else.

All you've done by creating those stats is to allow better estimates
for queries such as:

SELECT * FROM expenses WHERE day = '2018-06-28' and year = 2018;

> stxdependencies | {"3 => 5": 1.000000}

It would appear that "3" is the attnum for day and "5" is year. All
that tells the planner is that on the records sampled during analyze
is that each "day" had about exactly 1 year.

There's nothing then to stop you going and adding a record with the
day '2017-01-01' and the year 2018. The stats will remain the same
until you analyze the table again.

If those stats didn't exist, the planner would have multiplied the
selectivity estimates of each item in the WHERE clause individually.
So if about 10% of records had year=2018, and 0.01% had '2018-06-28',
then the selectivity would have been 0.1 *  0.001.   With a functional
dependency of 1, the selectivity just becomes 0.001.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services