Обсуждение: Strange query optimization in 7.3.2
Hello, I've encountered what seems to be a very strange behavior in the query optimizer using postgresql 7.3.2. Using a query that looks like this: EXPLAIN ANALYZE SELECT * from (trucks tr JOIN terminals t ON (t.terminal = tr.terminal) JOIN manifests m ON (tr.trailer = m.trailer)) JOIN stops s ON (m.manifest = s.manifest) WHERE ((tr.group_num = 1) AND (t.city_id = 2) AND (s.date BETWEEN '1/1/2003' AND '1/31/2003')); I get a somewhat slow query with a plan that is not terrbily optimal. However when I accidentally made the query look like this: EXPLAIN ANALYZE SELECT * from (trucks tr JOIN terminals t ON (t.terminal = tr.terminal) JOIN manifests m ON (tr.trailer = m.trailer)) JOIN stops s ON (m.manifest = s.manifest) WHERE ((s.manifest = m.manifest) AND (tr.trailer = m.trailer) AND (t.terminal = tr.terminal) AND (tr.group_num = 1) AND (t.city_id = 2) AND (s.date BETWEEN '1/1/2003' AND '1/31/2003')); The query plan choosen is more than three times as fast. This is especially strange considering it is identical to the first query apart from redundant information for the joins. If I change the JOIN ON to NATURAL JOIN the plan chosen is the the slow one, unless of course the redundant information is included as well. Using non-explicit joins also results in the slow query plan. The redundant information has a drastic effect on the query plan, but makes no change to the results. Does anyone have any idea why this might be? Here are the results of EXPLAIN ANALYZE for the two queries: The Slow query: Hash Join (cost=202.44..7332.06 rows=9711 width=140) (actual time=21.20..6567.87 rows=19775 loops=1) Hash Cond: ("outer".manifest = "inner".manifest) -> Seq Scan on stops s (cost=0.00..6421.15 rows=117416 width=88) (actual time=0.18..4811.78 rows=118606 loops=1) Filter: ((date >= '2003-01-01'::date) AND (date <= '2003-01-31'::date)) -> Hash (cost=199.64..199.64 rows=1119 width=52) (actual time=20.54..20.54 rows=0 loops=1) -> Merge Join (cost=21.86..199.64 rows=1119 width=52) (actual time=14.72..20.14 rows=52 loops=1) Merge Cond: ("outer".trailer = "inner".trailer) -> Index Scan using manifests_trailer_idx on manifests m (cost=0.00..516.82 rows=13526 width=15) (actual time=0.41..4.25 rows=174 loops=1) -> Sort (cost=21.86..22.01 rows=62 width=37) (actual time=10.86..11.01 rows=52 loops=1) Sort Key: tr.trailer -> Hash Join (cost=1.47..20.01 rows=62 width=37) (actual time=1.63..9.49 rows=52 loops=1) Hash Cond: ("outer".terminal = "inner".terminal) -> Seq Scan on trucks tr (cost=0.00..15.38 rows=319 width=21) (actual time=0.05..5.91 rows=319 loops=1) Filter: (group_num = 1) -> Hash (cost=1.45..1.45 rows=7 width=16) (actual time=0.32..0.32 rows=0 loops=1) -> Seq Scan on terminals t (cost=0.00..1.45 rows=7 width=16) (actual time=0.06..0.27 rows=7 loops=1) Filter: (city_id = 2) Total runtime: 6635.96 msec The faster (redundant) Query: Nested Loop (cost=21.86..6496.24 rows=9711 width=140) (actual time=14.90..2045.87 rows=19775 loops=1) -> Merge Join (cost=21.86..199.64 rows=1119 width=52) (actual time=14.67..32.01 rows=52 loops=1) Merge Cond: ("outer".trailer = "inner".trailer) -> Index Scan using manifests_trailer_idx on manifests m (cost=0.00..516.82 rows=13526 width=15) (actual time=0.40..4.47 rows=174 loops=1) -> Sort (cost=21.86..22.01 rows=62 width=37) (actual time=10.80..11.02 rows=52 loops=1) Sort Key: tr.trailer -> Hash Join (cost=1.47..20.01 rows=62 width=37) (actual time=1.49..9.42 rows=52 loops=1) Hash Cond: ("outer".terminal = "inner".terminal) -> Seq Scan on trucks tr (cost=0.00..15.38 rows=319 width=21) (actual time=0.04..5.88 rows=319 loops=1) Filter: (group_num = 1) -> Hash (cost=1.45..1.45 rows=7 width=16) (actual time=0.31..0.31 rows=0 loops=1) -> Seq Scan on terminals t (cost=0.00..1.45 rows=7 width=16) (actual time=0.06..0.26 rows=7 loops=1) Filter: (city_id = 2) -> Index Scan using stops_mainfest_date_idx on stops s (cost=0.00..5.62 rows=1 width=88) (actual time=0.13..17.13 rows=380 loops=52) Index Cond: (("outer".manifest = s.manifest) AND (s.manifest = "outer".manifest) AND (s.date >= '2003-01-01'::date) AND (s.date <= '2003-01-31'::date)) Total runtime: 2123.36 msec As you can see the planner estimates are all the same for the operations common to both plans, but a different choice is made for the final (big) join. Thanks in advance for any enlightenment on this strange issue. I'd rather not be using such a kludgy query to optimize things unless I have no other option. Alec Mitchell
Alec Mitchell <apm13@columbia.edu> writes: > I've encountered what seems to be a very strange behavior in the query > optimizer using postgresql 7.3.2. I think the reason for the change in plan is the same bug discussed at http://fts.postgresql.org/db/mw/msg.html?mid=1064055 However, you will probably not like the fix, since it eliminates the bogusly small cost estimate for the duplicated index condition, and thereby ensures that your less-favored plan will always be chosen :-( What would be interesting is to look into why the planner's estimated costs are inaccurate. I think the main cause is the badly-off join estimate for the tr/t join --- notice it's estimating 1119 rows out where only 52 are actually produced. The nestloop's runtime is directly proportional to the number of outer rows, so this leads directly to a factor-of-20 overestimate of the nestloop's cost, discouraging the planner from using it. The bug that's triggered by the duplicate index condition underestimates the cost, thereby negating that error to some extent. You should look into whether increasing the statistics targets for t.terminal and tr.terminal would improve the accuracy of the join estimate. regards, tom lane
On Monday 14 April 2003 01:25 pm, Tom Lane wrote: > I think the reason for the change in plan is the same bug discussed at > http://fts.postgresql.org/db/mw/msg.html?mid=1064055 > > However, you will probably not like the fix, since it eliminates the > bogusly small cost estimate for the duplicated index condition, and > thereby ensures that your less-favored plan will always be chosen :-( > > What would be interesting is to look into why the planner's estimated > costs are inaccurate. I think the main cause is the badly-off join > estimate for the tr/t join --- notice it's estimating 1119 rows out > where only 52 are actually produced. The nestloop's runtime is directly > proportional to the number of outer rows, so this leads directly to a > factor-of-20 overestimate of the nestloop's cost, discouraging the > planner from using it. The bug that's triggered by the duplicate > index condition underestimates the cost, thereby negating that error to > some extent. > > You should look into whether increasing the statistics targets for > t.terminal and tr.terminal would improve the accuracy of the join > estimate. That bug does seem to be the cause of the confusing plan change. I really don't understand that bad estimate for the join though (the estimate of 1119 is for the tr/m join, rather than the tr/t join). Here are some details about the tables and joins involved: The tr/t join produces 52 rows with unique trailers (the primary key on tr) out of the 750 available (the planner estimates 62). These are then joined with the manifests table m, which has 13526 rows. The relationship between tr.trailer and m.trailer is a bit complex. Of the 750 possible trailer values in tr, 607 have a one to one mapping to rows in m. The remaining 143 values are each referenced in 1-70 (avg 24) different rows in m. Additionally, there are 9510 rows in m (the vast majority), which have a null value for trailer (perhaps that is the cause of these bad statistics). This particular query happens to call only trailers with a one to one relationship to rows in m (this not unlikely considering that this condition is true for 607 out of the 750 values). Setting the statistics targets for m.trailer to 200 or even 750, both of which should be overkill considering that there are only 750 possible values, and then performing a VACUUM ANALYZE strangely makes no difference to the row estimates. I'm having a lot of trouble tracking down the reason for this bad estimate (all the preceding estimates are quite good). Any further guidance would be greatly appreciated. Thanks, Alec Mitchell
Alec Mitchell <apm13@columbia.edu> writes: > The tr/t join produces 52 rows with unique trailers (the primary key on tr) > out of the 750 available (the planner estimates 62). These are then joined > with the manifests table m, which has 13526 rows. The relationship between > tr.trailer and m.trailer is a bit complex. Of the 750 possible trailer > values in tr, 607 have a one to one mapping to rows in m. The remaining 143 > values are each referenced in 1-70 (avg 24) different rows in m. > Additionally, there are 9510 rows in m (the vast majority), which have a null > value for trailer (perhaps that is the cause of these bad statistics). Hmm ... we fixed a bug last fall in which NULLs were twice-excluded from the estimates for range queries, leading to silly results when the proportion of nulls got nontrivial. This isn't a range query, but I wonder if there's a similar bug lurking here ... Could you see your way to sending me a dump of these tables for testing purposes? You could exclude the columns not used in the FROM/WHERE clauses, if that is needed to satisfy privacy worries. regards, tom lane
Alec Mitchell <apm13@columbia.edu> writes: > On Monday 14 April 2003 04:08 pm, Tom Lane wrote: >> Could you see your way to sending me a dump of these tables for testing >> purposes? You could exclude the columns not used in the FROM/WHERE >> clauses, if that is needed to satisfy privacy worries. > I've attached dumps of the relevant tables (not including the stops > table (s), it is huge, and doesn't come into play until after the bad > planner estimate). Indeed, there's a problem here with nulls --- there is a case in eqjoinsel() that didn't account for nulls, and should've. I've applied the attached patch to fix it. This reduces the estimate of the tr/t/m join size from 1119 to 332, which is still large compared to the true size of 52, but it's a lot better. Without the stops table, I can't really tell if this changes the complete plan or not --- could you apply the patch and find out? BTW, the remaining error seems to be because the tr.group_num = 1 constraint skews the fraction of matching "trailer" values. I fear there's little chance of persuading the system to figure that out in the near future --- it can't be done without cross-column correlation statistics, which we do not keep. regards, tom lane *** src/backend/utils/adt/selfuncs.c.orig Sat Mar 22 20:49:13 2003 --- src/backend/utils/adt/selfuncs.c Tue Apr 15 01:13:01 2003 *************** *** 1552,1578 **** { /* * We do not have MCV lists for both sides. Estimate the join ! * selectivity as MIN(1/nd1, 1/nd2). This is plausible if we ! * assume that the values are about equally distributed: a ! * given tuple of rel1 will join to either 0 or N2/nd2 rows of ! * rel2, so total join rows are at most N1*N2/nd2 giving a ! * join selectivity of not more than 1/nd2. By the same logic ! * it is not more than 1/nd1, so MIN(1/nd1, 1/nd2) is an upper ! * bound. Using the MIN() means we estimate from the point of ! * view of the relation with smaller nd (since the larger nd ! * is determining the MIN). It is reasonable to assume that ! * most tuples in this rel will have join partners, so the ! * bound is probably reasonably tight and should be taken ! * as-is. * * XXX Can we be smarter if we have an MCV list for just one * side? It seems that if we assume equal distribution for the * other side, we end up with the same answer anyway. */ if (nd1 > nd2) ! selec = 1.0 / nd1; else ! selec = 1.0 / nd2; } if (have_mcvs1) --- 1552,1584 ---- { /* * We do not have MCV lists for both sides. Estimate the join ! * selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2). ! * This is plausible if we assume that the join operator is ! * strict and the non-null values are about equally distributed: ! * a given non-null tuple of rel1 will join to either zero or ! * N2*(1-nullfrac2)/nd2 rows of rel2, so total join rows are at ! * most N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join ! * selectivity of not more than (1-nullfrac1)*(1-nullfrac2)/nd2. ! * By the same logic it is not more than ! * (1-nullfrac1)*(1-nullfrac2)/nd1, so the expression with MIN() ! * is an upper bound. Using the MIN() means we estimate from the ! * point of view of the relation with smaller nd (since the larger ! * nd is determining the MIN). It is reasonable to assume that ! * most tuples in this rel will have join partners, so the bound ! * is probably reasonably tight and should be taken as-is. * * XXX Can we be smarter if we have an MCV list for just one * side? It seems that if we assume equal distribution for the * other side, we end up with the same answer anyway. */ + double nullfrac1 = stats1->stanullfrac; + double nullfrac2 = stats2->stanullfrac; + + selec = (1.0 - nullfrac1) * (1.0 - nullfrac2); if (nd1 > nd2) ! selec /= nd1; else ! selec /= nd2; } if (have_mcvs1)
On Monday 14 April 2003 10:39 pm, Tom Lane wrote: > Indeed, there's a problem here with nulls --- there is a case in > eqjoinsel() that didn't account for nulls, and should've. I've applied > the attached patch to fix it. This reduces the estimate of the tr/t/m > join size from 1119 to 332, which is still large compared to the true > size of 52, but it's a lot better. Without the stops table, I can't > really tell if this changes the complete plan or not --- could you apply > the patch and find out? > > BTW, the remaining error seems to be because the tr.group_num = 1 > constraint skews the fraction of matching "trailer" values. I fear > there's little chance of persuading the system to figure that out in the > near future --- it can't be done without cross-column correlation > statistics, which we do not keep. I applied the patch to the 7.3.2 sources, but strangely I get a segfault when I run initdb using the patched PostgreSQL. Specifically I get this error: creating system views... ok loading pg_description... /usr/local/pgsql/bin/initdb: line 801: 14633 Done ( cat <<EOF CREATE TEMP TABLE tmp_pg_description ( objoid oid, classname name,objsubid int4, description text) WITHOUT OIDS; COPY tmp_pg_description FROM STDIN; EOF cat "$POSTGRES_DESCR"; cat <<EOF \. INSERT INTO pg_description SELECT t.objoid, c.oid, t.objsubid, t.description FROM tmp_pg_description t, pg_class c WHERE c.relname = t.classname; EOF ) 14634 Segmentation fault | "$PGPATH"/postgres $PGSQL_OPT template1 >/dev/null initdb failed. Removing /var/lib/postgres/data. This doesn't make much sense to me as the patch is pretty straightforward. I'm running an up to date Debian Woody distribution, with a 2.4.21-pre6 kernel. Let me know if there is anything I can do to help figure this out. Having spent so much time thinking about this query and its odd results I've realized that I there is a query which does not involve the manifests table but returns the same results and always provides more accurate statistics for the planner (the stops table is not normalized and has some redundant data, allowing a few different paths to the same results). It's marginally slower but should scale better as the stops and manifests tables grow. As a result, I'm not too worried about the particulars of this optimization anymore (though looking at the plan and estimates for my new query, an estimate of 332 rows from m should probably result in the faster nested loop plan being choosen). Thanks for all your help, Alec Mitchell
Alec Mitchell <apm13@columbia.edu> writes: > I applied the patch to the 7.3.2 sources, but strangely I get a segfault when > I run initdb using the patched PostgreSQL. Specifically I get this error: Sigh ... I do know better than to commit changes without having regression-tested 'em. Honest ;-) Add this patch atop the last: *** src/backend/utils/adt/selfuncs.c.orig Tue Apr 15 01:18:12 2003 --- src/backend/utils/adt/selfuncs.c Wed Apr 16 00:37:58 2003 *************** *** 1610,1617 **** * side? It seems that if we assume equal distribution for the * other side, we end up with the same answer anyway. */ ! double nullfrac1 = stats1->stanullfrac; ! double nullfrac2 = stats2->stanullfrac; selec = (1.0 - nullfrac1) * (1.0 - nullfrac2); if (nd1 > nd2) --- 1610,1617 ---- * side? It seems that if we assume equal distribution for the * other side, we end up with the same answer anyway. */ ! double nullfrac1 = stats1 ? stats1->stanullfrac : 0.0; ! double nullfrac2 = stats2 ? stats2->stanullfrac : 0.0; selec = (1.0 - nullfrac1) * (1.0 - nullfrac2); if (nd1 > nd2) regards, tom lane
On Tuesday 15 April 2003 09:40 pm, Tom Lane wrote: > Sigh ... I do know better than to commit changes without having > regression-tested 'em. Honest ;-) > > Add this patch atop the last: > It is running once again, thanks. Unfortunately this doesn't seem to change the query plan (actual times have changed because this is running on a faster computer): Hash Join (cost=123.97..7169.64 rows=2890 width=140) (actual time=134.53..3106.29 rows=19775 loops=1) Hash Cond: ("outer".manifest = "inner".manifest) -> Seq Scan on stops s (cost=0.00..6421.15 rows=117680 width=88) (actual time=0.19..2241.00 rows=118606 loops=1) Filter: ((date >= '01/01/2003'::date) AND (date <= '01/31/2003'::date)) -> Hash (cost=123.13..123.13 rows=332 width=52) (actual time=134.09..134.09 rows=0 loops=1) -> Merge Join (cost=16.79..123.13 rows=332 width=52) (actual time=132.26..133.88 rows=52 loops=1) Merge Cond: ("outer".trailer = "inner".trailer) -> Index Scan using manifests_trailer_idx on manifests m (cost=0.00..309.32 rows=13526 width=15) (actual time=51.96..83.88 rows=174 loops=1) -> Sort (cost=16.79..16.94 rows=62 width=37) (actual time=48.68..48.74 rows=52 loops=1) Sort Key: tr.trailer -> Hash Join (cost=1.47..14.94 rows=62 width=37) (actual time=36.33..48.31 rows=52 loops=1) Hash Cond: ("outer".terminal = "inner".terminal) -> Index Scan using trucks_group_idx on trucks tr (cost=0.00..10.31 rows=319 width=21) (actual time=29.79..41.05 rows=319 loops=1) Index Cond: (group_num = 1) -> Hash (cost=1.45..1.45 rows=7 width=16) (actual time=5.82..5.82 rows=0 loops=1) -> Seq Scan on terminals t (cost=0.00..1.45 rows=7 width=16) (actual time=5.70..5.79 rows=7 loops=1) Filter: (city_id = 2) Total runtime: 3130.95 msec It seems odd to me that Index Scan on manifests predicts that all rows will be returned, but I guess that doesn't really influence the join decision. Also it turns out the "better" query I was using mistakenly included redundant join conditions again. The same slow execution path is choosen without the redundancy (silly me). So even a query that gives accurate planner statistics (estimate of 62 rows vs. 52 actual rows at the join) results in a hash join which ignores the existing indices rather than a nested loop which uses them efficiently. Maybe I should just disable seqscan for this query, and reenable it afterward (it will be run within a function, so that should be easy). If you think that this issue merits further consideration, I'd be happy to send you the relevant columns of the stops table. Many thanks for your help so far. Alec Mitchell
On Thursday 17 April 2003 10:12 am, Alec Mitchell wrote: > It seems odd to me that Index Scan on manifests predicts that all rows will > be returned, but I guess that doesn't really influence the join decision. > Also it turns out the "better" query I was using mistakenly included > redundant join conditions again. The same slow execution path is choosen > without the redundancy (silly me). So even a query that gives accurate > planner statistics (estimate of 62 rows vs. 52 actual rows at the join) > results in a hash join which ignores the existing indices rather than a > nested loop which uses them efficiently. Maybe I should just disable > seqscan for this query, and reenable it afterward (it will be run within a > function, so that should be easy). If you think that this issue merits > further consideration, I'd be happy to send you the relevant columns of the > stops table. Many thanks for your help so far. > I'll reply to myself to provide some details on why the nested loop isn't choosen unless enable_seqscan is set to off. This query skips the indirect route through the manifests table and joins directly to the stops table on (s.trailer = tr.trailer) like this: SELECT * from (trucks tr JOIN terminals t ON (t.terminal = tr.terminal)) JOIN stops s ON (tr.trailer = s.trailer) WHERE ((tr.group_num = 1) AND (t.city_id = 2) AND (s.date BETWEEN '1/1/2003' AND '1/31/2003')); I had been doing the join on manifests because the join on an integer is faster than a join on varchar, but manifests will grow much faster than trailers, that combined with the unsolvably poor statistics for the tr/m join make that a poor choice for this query as the tables grow. With enable_seqscan set to off, I get the fastest query plan (at least with the patched version, unpatched I have to turn off both enable_hashjoin and enable_mergejoin). It looks like the reason the fast nested loop isn't choosen by default is that the planner estimates a cost of 477.54 per loop for 62 loops, whereas the actual cost is 8.34 per loop for 52 loops. That's an order of magnitude farther off than the planners other cost estimates. The EXPLAIN ANALYZE output follows: Nested Loop (cost=6.02..29697.65 rows=7545 width=125) (actual time=1.43..756.54 rows=19775 loops=1) -> Hash Join (cost=6.02..19.50 rows=62 width=37) (actual time=1.32..5.22 rows=52 loops=1) Hash Cond: ("outer".terminal = "inner".terminal) -> Index Scan using trucks_group_idx on trucks tr (cost=0.00..10.31 rows=319 width=21) (actual time=0.04..3.12 rows=319 loops=1) Index Cond: (group_num = 1) -> Hash (cost=6.00..6.00 rows=7 width=16) (actual time=0.58..0.58 rows=0 loops=1) -> Index Scan using terminals_pkey on terminals t (cost=0.00..6.00 rows=7 width=16) (actual time=0.38..0.55 rows=7 loops=1) Filter: (city_id = 2) -> Index Scan using stops_trailer_date_idx on stops s (cost=0.00..476.91 rows=124 width=88) (actual time=0.05..7.89 rows=380 loops=52) Index Cond: (("outer".trailer = s.trailer) AND (s.date >= '01/01/2003'::date) AND (s.date <= '01/31/2003'::date)) Total runtime: 785.12 msec Thanks Again, Alec Mitchell
Alec Mitchell <apm13@columbia.edu> writes: > With enable_seqscan set to off, I get the fastest query plan (at least with > the patched version, unpatched I have to turn off both enable_hashjoin and > enable_mergejoin). It looks like the reason the fast nested loop isn't > choosen by default is that the planner estimates a cost of 477.54 per loop > for 62 loops, whereas the actual cost is 8.34 per loop for 52 loops. This is a known issue --- the planner overestimates the cost of a nestloop with inner indexscan, because it doesn't allow for the fact that the successive inner indexscans aren't independent. (The upper btree layers, at least, are sure to stay in memory over the successive probes of the inner table ... but the costing charges for a from-scratch indexscan each time through.) We've discussed this before but I don't think anyone's found an appropriate substitute equation. regards, tom lane