Analyzing bug 8049
От | Tom Lane |
---|---|
Тема | Analyzing bug 8049 |
Дата | |
Msg-id | 6546.1365701142@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Analyzing bug 8049
|
Список | pgsql-hackers |
I looked into the problem reported at http://www.postgresql.org/message-id/E1UPa3B-0004K0-PJ@wrigleys.postgresql.org which is that we're deriving a bogus plan for this query: SELECT * FROM ( SELECT (COALESCE(h_n || '/', '') || l_n)::text AS fault FROM (SELECT _bug_header.h_n, _bug_line.l_nFROM _bug_lineLEFT OUTERJOIN _bug_header on (_bug_line.h_n = _bug_header.h_n) ) AS tmp ) AS tmp2 WHERE (lower(fault) = '1') ORDER BY lower(fault); With use of a nestloop forced, we get this plan: Nested Loop Left Join (cost=0.16..513.11 rows=11 width=8) -> Seq Scan on _bug_line (cost=0.00..31.40 rows=2140 width=8) -> Index Only Scan using _bug_header_unique on _bug_header (cost=0.16..0.21 rows=1 width=4) Index Cond:(h_n = _bug_line.h_n) Filter: (lower((COALESCE(((h_n)::text || '/'::text), ''::text) || (_bug_line.l_n)::text))= '1'::text) in which the problem is pretty obvious: the upper WHERE condition has been pushed to the scan of _bug_header, at which level we get the wrong answers from the COALESCE because the reference to _bug_header.h_n doesn't go to NULL when it should do so because of the outer join. Curiously, this doesn't happen without the ORDER BY. (If you're wondering why there's no sort step in this plan, it's because the planner figures out that the equality constraint on lower(fault) renders the ORDER BY a no-op. But the damage has already been done by then.) In HEAD, this happens because join_clause_is_movable_to() thinks it's safe to move the join clause (lower(fault) = '1') to be evaluated inside the nestloop. Now, that function is perfectly aware of this type of hazard, so it's faithfully checking the clause's nullable_relids to make sure it doesn't push a clause from above an outer join into the RHS of the join. However, it's seeing NULL as the value of the clause's nullable_relids. Ooops. And the reason for that is that what it's seeing isn't actually the original WHERE clause, but a reconstructed version that's been regurgitated by the EquivalenceClass machinery. And that machinery is failing to stamp the constructed clause with the right nullable_relids. I thought I'd fixed that type of problem last fall, in commit 72a4231f0c80f213a4fa75356dc3c6b7c7419059, which added code to attempt to compute those relid sets at need. But there was an oversight in that patch: get_eclass_for_sort_expr() always sets nullable_relids to NULL in any new equivalence-class members that it creates to represent sort keys. In this example, it's obviously bogus for the ORDER BY key "lower(fault)" to be considered as having no nullable_relids, because it's above an outer join that can null _bug_header.h_n. This explains why the ORDER BY clause is needed for the bug to manifest --- if the equivalence-class member for "lower(fault)" is created as a result of processing the WHERE clause, then it's all good because the member is created with the right em_nullable_relids. So, okay, we need to fix get_eclass_for_sort_expr() to compute valid nullable_relids for any sort clauses it's processing. This is a bit problematic because the information isn't readily available to it, but we can certainly add some code to compute that. However, reflecting on this example convinces me that there's a bigger problem: the equivalence-class machinery assumes that textually equivalent expressions (more formally, equal() expression trees) mean the same thing. They don't necessarily. In this example, _bug_header.h_n means something different below the outer join than it does above it. We have hacked around that problem in the past via various ad-hoc rules involving not treating equality constraints as true equivalences if they were affected by outer joins. But that approach was always pretty squishy, and I've wished for some more principled way to define the rules. I now think I see one: to consider two expressions equivalent, we ought to consider not only textual equality, but whether they are subject to the effects of the same outer joins, which we could represent as the relevant nullable_relids sets. So basically this would mean changing equivclass.c to consider the identity of an EquivalenceMember as defined by em_expr plus em_nullable_relids, not just em_expr. Having done that I think we could remove all the bizarre code about below_outer_join and so forth, which is both of dubious correctness and an obstacle to complete optimization of queries. This idea needs more fleshing out, but it's seeming awfully attractive right now. The big problem with it is that it's going to be a more invasive patch than I feel terribly comfortable about back-patching. However, I'm not sure there's much choice, because I don't see any narrow fix for 9.2 that would not result in very substantial degradation of its optimization ability. We can't just lobotomize equivalence-class processing. The plan I'm considering is to get this written and committed to HEAD in the next week, so that it can go out in 9.3beta1. After the patch has survived a reasonable amount of beta testing, I'd be more comfortable about back-patching into 9.2. BTW, I am also suspicious that there are related examples that fail pre-9.2, though I don't have one yet. I previously back-patched 72a4231f0c80f213a4fa75356dc3c6b7c7419059 into all the older branches, so I am wondering if we might need to back-patch this rewrite further than 9.2 as well. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления: