Обсуждение: strange estimate for number of rows
Hi all, I've one here that I cannot fathom. Any suggestions? We have a table, call it tablename, where we're selecting by a range of dates and an identifier. (This is redacted, obviously): \d tablename Column | Type | Modifiers --------------------+--------------------------+-------------------- id | integer | not null transaction_date | timestamp with time zone | not null product_id | integer | not null Indexes: "trans_posted_trans_date_idx" btree (transaction_date, product_id) The statistics on transaction_date and product_id are set to 1000. Everything is all analysed nicely. But I'm getting a poor plan, because of an estimate that the number of rows to be returned is about double how many actually are: explain analyse select * from transactions_posted where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and product_id = 2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on transactions_posted (cost=0.00..376630.33 rows=700923 width=91) (actual time=8422.253..36176.078 rows=316029 loops=1) Filter: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id = 2)) Total runtime: 36357.630 ms (3 rows) SET enable_seqscan = off; explain analyse select * from transactions_posted where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and product_id = 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using trans_posted_trans_date_idx on transactions_posted (cost=0.00..1088862.56 rows=700923 width=91) (actual time=35.214..14816.257 rows=316029 loops=1) Index Cond: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id = 2)) Total runtime: 15009.816 ms (3 rows) SELECT attname,null_frac,avg_width,n_distinct,correlation FROM pg_stats where tablename = 'transactions_posted' AND attname in ('transaction_date','product_id'); attname | null_frac | avg_width | n_distinct | correlation ------------------+-----------+-----------+------------+------------- product_id | 0 | 4 | 2 | 0.200956 transaction_date | 0 | 8 | -0.200791 | 0.289248 Any ideas? I'm loathe to recommend cluster, since the data will not stay clustered. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> writes: > The statistics on transaction_date and product_id are set to 1000. > Everything is all analysed nicely. But I'm getting a poor plan, > because of an estimate that the number of rows to be returned is > about double how many actually are: > explain analyse select * from transactions_posted where > transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and > product_id = 2; Are the estimates accurate for queries on the two columns individually, ie ... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' ... where product_id = 2 If so, the problem is that there's a correlation between transaction_date and product_id, which the system cannot model because it has no multi-column statistics. However, given that the estimate is only off by about a factor of 2, you'd still be getting the wrong plan even if the estimate were perfect, because the estimated costs differ by nearly a factor of 3. Given the actual runtimes, I'm thinking maybe you want to reduce random_page_cost. What are you using for that now? regards, tom lane
Hi, I'm the lead developer on the project this concerns (forgive my newbiness on this list). We tried a couple of scenarios with effective_cache_size=60000, cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the plan. explain analyse select * from tablename where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1'; ----------------------------------------------------------------------------------------------------------------------------------- Seq Scan on tablename (cost=0.00..348199.14 rows=1180724 width=91) (actual time=7727.668..36286.898 rows=579238 loops=1) Filter: ((transaction_date >= '2003-09-01 00:00:00+00'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00+00'::timestamp with time zone)) Total runtime: 36625.351 ms explain analyse select * from transactions_posted where product_id = 2; ----------------------------------------------------------------------------------------------------------------------------------- Seq Scan on transactions_posted (cost=0.00..319767.95 rows=6785237 width=91) (actual time=0.091..35596.328 rows=5713877 loops=1) Filter: (product_id = 2) Total runtime: 38685.373 ms The product_id alone gives a difference of a millions rows from estimate to actual, vs. the factor of 2 from the transaction_date. Dan Manley Tom Lane пишет: >Andrew Sullivan <andrew@libertyrms.info> writes: > > >>The statistics on transaction_date and product_id are set to 1000. >>Everything is all analysed nicely. But I'm getting a poor plan, >>because of an estimate that the number of rows to be returned is >>about double how many actually are: >> >> > > > >>explain analyse select * from transactions_posted where >>transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and >>product_id = 2; >> >> > >Are the estimates accurate for queries on the two columns individually, >ie >... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' >... where product_id = 2 > >If so, the problem is that there's a correlation between >transaction_date and product_id, which the system cannot model because >it has no multi-column statistics. > >However, given that the estimate is only off by about a factor of 2, >you'd still be getting the wrong plan even if the estimate were perfect, >because the estimated costs differ by nearly a factor of 3. > >Given the actual runtimes, I'm thinking maybe you want to reduce >random_page_cost. What are you using for that now? > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > >
Daniel Manley <dmanley@libertyrms.info> writes: > The product_id alone gives a difference of a millions rows from estimate > to actual, vs. the factor of 2 from the transaction_date. You should be thinking in terms of ratios, not absolute difference. The rows estimate for product_id doesn't look too bad to me; the one for transaction_date is much further off (a factor of 2). Which is odd, because the system can usually do all right on range estimates if you've let it run an ANALYZE with enough histogram bins. Could we see the pg_stats row for transaction_date? > We tried a couple of scenarios with effective_cache_size=60000, > cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the > plan. Since you need about a factor of 3 change in the cost estimate to get it to switch plans, changing random_page_cost by a factor of 2 ain't gonna do it (the other two numbers are second-order adjustments unlikely to have much effect, I think). Try 1.5, or even less ... of course, you have to keep an eye on your other queries and make sure they don't go nuts, but from what I've heard about your hardware setup a low random_page_cost isn't out of line with the physical realities. regards, tom lane
On Thu, Nov 13, 2003 at 03:19:08PM -0500, Tom Lane wrote: > because the system can usually do all right on range estimates if you've > let it run an ANALYZE with enough histogram bins. Could we see the > pg_stats row for transaction_date? Do you want the whole thing? I left out the really verbose bits when I posted this in the original: SELECT attname,null_frac,avg_width,n_distinct,correlation FROM pg_stats where tablename = 'transactions_posted' AND attname in ('transaction_date','product_id'); attname | null_frac | avg_width | n_distinct | correlation ------------------+-----------+-----------+------------+------------- product_id | 0 | 4 | 2 | 0.200956 transaction_date | 0 | 8 | -0.200791 | 0.289248 > > Since you need about a factor of 3 change in the cost estimate to get it to > switch plans, changing random_page_cost by a factor of 2 ain't gonna do > it (the other two numbers are second-order adjustments unlikely to have > much effect, I think). Try 1.5, or even less ... of course, you have to > keep an eye on your other queries and make sure they don't go nuts, but > from what I've heard about your hardware setup a low random_page_cost > isn't out of line with the physical realities. Actually, this one's on an internal box, and I think 1.5 is too low -- it's really just a pair of mirrored SCSI disks on a PCI controller in this case. That does the trick, though, so maybe I'm just being too conservantive. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, Nov 13, 2003 at 04:37:03PM -0500, Andrew Sullivan wrote: > Actually, this one's on an internal box, and I think 1.5 is too low > -- it's really just a pair of mirrored SCSI disks on a PCI controller > in this case. That does the trick, though, so maybe I'm just being > too conservantive. I spoke too soon. I'd left enable_seqscan=off set. It doesn't actually prefer an indexscan until I set the random_page_cost to .5. I think that's a little unrealistic ;-) A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110