Обсуждение: strange estimate for number of rows

Поиск
Список
Период
Сортировка

strange estimate for number of rows

От
Andrew Sullivan
Дата:
Hi all,

I've one here that I cannot fathom.  Any suggestions?

We have a table, call it tablename, where we're selecting by a range
of dates and an identifier.  (This is redacted, obviously):

\d tablename

       Column       |           Type           |     Modifiers
--------------------+--------------------------+--------------------
 id                 | integer                  | not null
 transaction_date   | timestamp with time zone | not null
 product_id         | integer                  | not null
Indexes:
    "trans_posted_trans_date_idx" btree (transaction_date, product_id)


The statistics on transaction_date and product_id are set to 1000.
Everything is all analysed nicely.  But I'm getting a poor plan,
because of an estimate that the number of rows to be returned is
about double how many actually are:

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on transactions_posted  (cost=0.00..376630.33 rows=700923
width=91) (actual time=8422.253..36176.078 rows=316029 loops=1)
   Filter: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp
with time zone) AND (transaction_date < '2003-10-01
00:00:00-04'::timestamp with time zone) AND (product_id = 2))
 Total runtime: 36357.630 ms
(3 rows)

SET enable_seqscan = off;

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trans_posted_trans_date_idx on transactions_posted
(cost=0.00..1088862.56 rows=700923 width=91) (actual
time=35.214..14816.257 rows=316029 loops=1)
   Index Cond: ((transaction_date >= '2003-09-01
00:00:00-04'::timestamp with time zone) AND (transaction_date <
'2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id =
2))
 Total runtime: 15009.816 ms
(3 rows)

SELECT attname,null_frac,avg_width,n_distinct,correlation FROM
pg_stats where tablename = 'transactions_posted' AND attname in
('transaction_date','product_id');
     attname      | null_frac | avg_width | n_distinct | correlation
------------------+-----------+-----------+------------+-------------
 product_id       |         0 |         4 |          2 |    0.200956
 transaction_date |         0 |         8 |  -0.200791 |    0.289248

Any ideas?  I'm loathe to recommend cluster, since the data will not
stay clustered.

A


--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: strange estimate for number of rows

От
Tom Lane
Дата:
Andrew Sullivan <andrew@libertyrms.info> writes:
> The statistics on transaction_date and product_id are set to 1000.
> Everything is all analysed nicely.  But I'm getting a poor plan,
> because of an estimate that the number of rows to be returned is
> about double how many actually are:

> explain analyse select * from transactions_posted where
> transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
> product_id = 2;

Are the estimates accurate for queries on the two columns individually,
ie
... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1'
... where product_id = 2

If so, the problem is that there's a correlation between
transaction_date and product_id, which the system cannot model because
it has no multi-column statistics.

However, given that the estimate is only off by about a factor of 2,
you'd still be getting the wrong plan even if the estimate were perfect,
because the estimated costs differ by nearly a factor of 3.

Given the actual runtimes, I'm thinking maybe you want to reduce
random_page_cost.  What are you using for that now?

            regards, tom lane

Re: strange estimate for number of rows

От
Daniel Manley
Дата:
Hi, I'm the lead developer on the project this concerns (forgive my
newbiness on this list).

We tried a couple of scenarios with effective_cache_size=60000,
cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the
plan.

explain analyse select * from tablename where transaction_date >=
'2003-9-1' and transaction_date < '2003-10-1';

-----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on tablename  (cost=0.00..348199.14 rows=1180724 width=91)
(actual time=7727.668..36286.898 rows=579238 loops=1)
   Filter: ((transaction_date >= '2003-09-01 00:00:00+00'::timestamp
with time zone) AND (transaction_date < '2003-10-01
00:00:00+00'::timestamp with time zone))
 Total runtime: 36625.351 ms

explain analyse select * from transactions_posted where product_id = 2;

-----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on transactions_posted  (cost=0.00..319767.95 rows=6785237
width=91) (actual time=0.091..35596.328 rows=5713877 loops=1)
   Filter: (product_id = 2)
 Total runtime: 38685.373 ms

The product_id alone gives a difference of a millions rows from estimate
to actual, vs. the factor of 2 from the transaction_date.

Dan Manley

Tom Lane пишет:

>Andrew Sullivan <andrew@libertyrms.info> writes:
>
>
>>The statistics on transaction_date and product_id are set to 1000.
>>Everything is all analysed nicely.  But I'm getting a poor plan,
>>because of an estimate that the number of rows to be returned is
>>about double how many actually are:
>>
>>
>
>
>
>>explain analyse select * from transactions_posted where
>>transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
>>product_id = 2;
>>
>>
>
>Are the estimates accurate for queries on the two columns individually,
>ie
>... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1'
>... where product_id = 2
>
>If so, the problem is that there's a correlation between
>transaction_date and product_id, which the system cannot model because
>it has no multi-column statistics.
>
>However, given that the estimate is only off by about a factor of 2,
>you'd still be getting the wrong plan even if the estimate were perfect,
>because the estimated costs differ by nearly a factor of 3.
>
>Given the actual runtimes, I'm thinking maybe you want to reduce
>random_page_cost.  What are you using for that now?
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>
>



Re: strange estimate for number of rows

От
Tom Lane
Дата:
Daniel Manley <dmanley@libertyrms.info> writes:
> The product_id alone gives a difference of a millions rows from estimate
> to actual, vs. the factor of 2 from the transaction_date.

You should be thinking in terms of ratios, not absolute difference.
The rows estimate for product_id doesn't look too bad to me; the one for
transaction_date is much further off (a factor of 2).  Which is odd,
because the system can usually do all right on range estimates if you've
let it run an ANALYZE with enough histogram bins.  Could we see the
pg_stats row for transaction_date?

> We tried a couple of scenarios with effective_cache_size=60000,
> cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the
> plan.

Since you need about a factor of 3 change in the cost estimate to get it to
switch plans, changing random_page_cost by a factor of 2 ain't gonna do
it (the other two numbers are second-order adjustments unlikely to have
much effect, I think).  Try 1.5, or even less ... of course, you have to
keep an eye on your other queries and make sure they don't go nuts, but
from what I've heard about your hardware setup a low random_page_cost
isn't out of line with the physical realities.

            regards, tom lane

Re: strange estimate for number of rows

От
Andrew Sullivan
Дата:
On Thu, Nov 13, 2003 at 03:19:08PM -0500, Tom Lane wrote:
> because the system can usually do all right on range estimates if you've
> let it run an ANALYZE with enough histogram bins.  Could we see the
> pg_stats row for transaction_date?

Do you want the whole thing?  I left out the really verbose bits when
I posted this in the original:

SELECT attname,null_frac,avg_width,n_distinct,correlation FROM
pg_stats where tablename = 'transactions_posted' AND attname in
('transaction_date','product_id');
     attname      | null_frac | avg_width | n_distinct | correlation
------------------+-----------+-----------+------------+-------------
 product_id       |         0 |         4 |          2 |    0.200956
 transaction_date |         0 |         8 |  -0.200791 |    0.289248

>
> Since you need about a factor of 3 change in the cost estimate to get it to
> switch plans, changing random_page_cost by a factor of 2 ain't gonna do
> it (the other two numbers are second-order adjustments unlikely to have
> much effect, I think).  Try 1.5, or even less ... of course, you have to
> keep an eye on your other queries and make sure they don't go nuts, but
> from what I've heard about your hardware setup a low random_page_cost
> isn't out of line with the physical realities.

Actually, this one's on an internal box, and I think 1.5 is too low
-- it's really just a pair of mirrored SCSI disks on a PCI controller
in this case.  That does the trick, though, so maybe I'm just being
too conservantive.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: strange estimate for number of rows

От
Andrew Sullivan
Дата:
On Thu, Nov 13, 2003 at 04:37:03PM -0500, Andrew Sullivan wrote:
> Actually, this one's on an internal box, and I think 1.5 is too low
> -- it's really just a pair of mirrored SCSI disks on a PCI controller
> in this case.  That does the trick, though, so maybe I'm just being
> too conservantive.

I spoke too soon.  I'd left enable_seqscan=off set.  It doesn't
actually prefer an indexscan until I set the random_page_cost to .5.
I think that's a little unrealistic ;-)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110