Обсуждение: Bad Planner Statistics for Uneven distribution.

Поиск
Список
Период
Сортировка

Bad Planner Statistics for Uneven distribution.

От
"Kevin McArthur"
Дата:
I discussed this with a few members of #postgresql freenode this morning. I'll keep it breif; [note: i have cleaned out columns not relevant]
 
I have two tables, brands and models_brands. The first has about 300 records, the later about 350,000 records. The number of distinct brands in the models_brands table is < 10.
 
 
 
=# \d models_brands
        Table "public.models_brands"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 model  | integer               | not null
 brand  | integer               | not null
Indexes:
    "models_brands_brand" btree (brand)
Foreign-key constraints:
    "models_brands_brand_fkey" FOREIGN KEY (brand) REFERENCES brands(brand_id) ON UPDATE CASCADE ON DELETE CASCADE
    "models_brands_model_fkey" FOREIGN KEY (model) REFERENCES models(model_id) ON UPDATE CASCADE ON DELETE CASCADE
 
a=# \d brands;
                                      Table "public.brands"
   Column   |          Type          |                         Modifiers
------------+------------------------+-----------------------------------------------------------
 brand_id   | integer                | not null default nextval('brands_brand_id_seq'::regclass)
 brand_name | character varying(255) | not null
Indexes:
    "brands_pkey" PRIMARY KEY, btree (brand_id)
 
Now the plans/problems..
 
=# set enable_seqscan to on;
SET
=# explain analyze select distinct brand from models_brands;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=46300.70..48148.15 rows=4 width=4) (actual time=3699.691..6215.216 rows=4 loops=1)
   ->  Sort  (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3699.681..5027.069 rows=369489 loops=1)
         Sort Key: brand
         ->  Seq Scan on models_brands  (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.040..1352.997 rows=369489 loops=1)
 Total runtime: 6223.666 ms
(5 rows)
 
=# set enable_seqscan to off;
SET
=# explain analyze select distinct brand from models_brands;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..863160.68 rows=4 width=4) (actual time=0.131..2584.779 rows=4 loops=1)
   ->  Index Scan using models_brands_brand on models_brands  (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.122..1440.809 rows=369489 loops=1)
 Total runtime: 2584.871 ms
(3 rows)
 
 
Picks the wrong plan here. Should pick the index with seqscanning enabled.
 
 
More (as a different wording/query)... (as suggested by others on irc)
 
 
=# set enable_seqscan to on;
SET
=# explain analyze select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on brands  (cost=0.00..30.09 rows=152 width=4) (actual time=7742.460..62567.543 rows=4 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Seq Scan on models_brands  (cost=0.00..7335.61 rows=92372 width=0) (actual time=206.467..206.467 rows=0 loops=303)
           Filter: (brand = $0)
 Total runtime: 62567.626 ms
 
a=# set enable_seqscan to off;
SET
=# explain analyze select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on brands  (cost=100000000.00..100000715.90 rows=152 width=4) (actual time=0.615..3.710 rows=4 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using models_brands_brand on models_brands  (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008 rows=0 loops=303)
           Index Cond: (brand = $0)
 Total runtime: 3.790 ms
 
 
It was also tried to similar results with a LIMIT 1 in the subquery for exist.
 
More...
 
Seqscan still off..
 
 
=# explain analyze select distinct brand_id from brands inner join models_brands on (brand_id = brand);
                                                                           QUERY PLAN                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..867782.58 rows=303 width=4) (actual time=0.391..4898.579 rows=4 loops=1)
   ->  Merge Join  (cost=0.00..866858.85 rows=369489 width=4) (actual time=0.383..3749.771 rows=369489 loops=1)
         Merge Cond: ("outer".brand_id = "inner".brand)
         ->  Index Scan using brands_pkey on brands  (cost=0.00..15.53 rows=303 width=4) (actual time=0.080..0.299 rows=60 loops=1)
         ->  Index Scan using models_brands_brand on models_brands  (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.013..1403.175 rows=369489 loops=1)
 Total runtime: 4898.697 ms
 
=# set enable_seqscan to on;
SET
=# explain analyze select distinct brand_id from brands inner join models_brands on (brand_id = brand);
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=46300.70..52770.04 rows=303 width=4) (actual time=3742.046..8560.833 rows=4 loops=1)
   ->  Merge Join  (cost=46300.70..51846.32 rows=369489 width=4) (actual time=3742.035..7406.677 rows=369489 loops=1)
         Merge Cond: ("outer".brand_id = "inner".brand)
         ->  Index Scan using brands_pkey on brands  (cost=0.00..15.53 rows=303 width=4) (actual time=0.077..0.407 rows=60 loops=1)
         ->  Sort  (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3741.584..5051.348 rows=369489 loops=1)
               Sort Key: models_brands.brand
               ->  Seq Scan on models_brands  (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.027..1346.178 rows=369489 loops=1)
 Total runtime: 8589.502 ms
(8 rows)
 
 
Hope that helps
 
Kevin McArthur
 
 
 
 
 

 
 

Re: Bad Planner Statistics for Uneven distribution.

От
Tom Lane
Дата:
"Kevin McArthur" <Kevin@StormTide.ca> writes:
>          ->  Seq Scan on models_brands  (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.040..1352.997
rows=369489loops=1) 
> ...
>    ->  Index Scan using models_brands_brand on models_brands  (cost=0.00..862236.96 rows=369489 width=4) (actual
time=0.122..1440.809rows=369489 loops=1) 

> Picks the wrong plan here. Should pick the index with seqscanning enabled.

It's really not possible for a full-table indexscan to be faster than a
seqscan, and not very credible for it even to be approximately as fast.
I suspect your second query here is the beneficiary of the first query
having fetched all the pages into cache.  In general, if you want to
optimize for a mostly-cached database, you need to reduce
random_page_cost below its default value ...

            regards, tom lane

Re: Bad Planner Statistics for Uneven distribution.

От
"Guillaume Smet"
Дата:
Tom,

On 7/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's really not possible for a full-table indexscan to be faster than a
> seqscan, and not very credible for it even to be approximately as fast.
> I suspect your second query here is the beneficiary of the first query
> having fetched all the pages into cache.  In general, if you want to
> optimize for a mostly-cached database, you need to reduce
> random_page_cost below its default value ...

We discussed this case on IRC and the problem was not the first set of
queries but the second one:
select brand_id from brands where exists (select 1 from models_brands
where brand = brands.brand_id);).

Isn't there any way to make PostgreSQL have a better estimation here:
->  Index Scan using models_brands_brand on models_brands
(cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008
rows=0 loops=303)
           Index Cond: (brand = $0)

I suppose it's because the planner estimates that there will be 92372
result rows that it chooses the seqscan instead of the index scan.
ALTER STATISTICS didn't change anything.
IIRC, there were already a few threads about the same sort of
estimation problem and there wasn't any solution to solve this
problem. Do you have any hint/ideas?

--
Guillaume

Re: Bad Planner Statistics for Uneven distribution.

От
Tom Lane
Дата:
"Guillaume Smet" <guillaume.smet@gmail.com> writes:
> Isn't there any way to make PostgreSQL have a better estimation here:
> ->  Index Scan using models_brands_brand on models_brands
> (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008
> rows=0 loops=303)
>            Index Cond: (brand = $0)

Note that the above plan extract is pretty misleading, because it
doesn't account for the implicit "LIMIT 1" of an EXISTS() clause.
What the planner is *actually* imputing to this plan is 216410.97/92372
cost units, or about 2.34.  However that applies to the seqscan variant
as well.

I think the real issue with Kevin's example is that when doing an
EXISTS() on a brand_id that doesn't actually exist in the table, the
seqscan plan has worst-case behavior (ie, scan the whole table) while
the indexscan plan still manages to be cheap.  Because his brands table
has so many brand_ids that aren't in the table, that case dominates the
results.  Not sure how we could factor that risk into the cost
estimates.  The EXISTS code could probably special-case it reasonably
well for the simplest seqscan and indexscan subplans, but I don't see
what to do with more general subqueries (like joins).

            regards, tom lane