Обсуждение: why is a constraint not 'pushed down' into a subselect when this subselect is using a 'group by' ??
Hi all, I noticed the following. Given two tables, just simply articles and their packages: article(id int) package( id int, article_id int, amount) When taking the minimum package for articles given some constraint on the article table select article.id, p_min from article, (select article_id, min(amount) as p_min from package group by article_id ) as foo where article.id = foo.article_id and article.id < 50; The query plan shows QUERY PLAN ---------------------------------------------------------------------------------------- Hash Join (cost=730.11..808.02 rows=1 width=8) Hash Cond: ("outer".article_id = "inner".id) -> Subquery Scan foo (cost=726.10..781.74 rows=4451 width=8) -> HashAggregate (cost=726.10..737.23 rows=4451 width=8) -> Seq Scan on package (cost=0.00..635.40 rows=18140 width=8) -> Hash (cost=4.01..4.01 rows=1 width=4) -> Index Scan using article_pkey on article (cost=0.00..4.01 rows=1 width=4) Index Cond: (id < 50) So, a seq scan on the complete package table is done, although it seems the planner could have deduced that only the part that has 'article_id < 50' would be relevant, since 'article.id < 50' and 'article.id = foo.article_id' If I ditch the group by, then this contraint does get pushed into the subselect : select article.id, p_min from article, (select article_id, amount as p_min from package ) as foo where article.id = foo.article_id and article.id < 50; QUERY PLAN ----------------------------------------------------------------------------------- Nested Loop (cost=0.00..14.22 rows=5 width=8) -> Index Scan using article_pkey on article (cost=0.00..4.01 rows=1 width=4) Index Cond: (id < 50) -> Index Scan using package_idx3 on package (cost=0.00..10.16 rows=4 width=8) Index Cond: ("outer".id = package.article_id) So it seems the math-knowledge is there ;) I'm wondering about what I'm overlooking here. When I take the first query and add the article constraint 'manually', I get: select article.id, p_min from article, (select article_id, min(amount) as p_min from package where article_id < 50 group by article_id ) as foo where article.id = foo.article_id and article.id < 50; QUERY PLAN ---------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..9.65 rows=1 width=8) Join Filter: ("outer".id = "inner".article_id) -> Index Scan using article_pkey on article (cost=0.00..4.01 rows=1 width=4) Index Cond: (id < 50) -> Subquery Scan foo (cost=0.00..5.62 rows=1 width=8) -> GroupAggregate (cost=0.00..5.61 rows=1 width=8) -> Index Scan using package_idx3 on package (cost=0.00..5.60 rows=2 width=8) Index Cond: (article_id < 50) which would have been a nice plan for the first query ;) Could someone shine a light on this, please? -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > If I ditch the group by, then this contraint does get pushed into the > subselect : You're misinterpreting it. Without the group by, the plan is a candidate for nestloop-with-inner-index-scan; with the group by, there's another step in the way. Pushing down into subselects does get done, for instance in CVS tip I can change the last part of your query to "foo.article_id < 50" and get Hash Join (cost=25.18..53.53 rows=335 width=8) Hash Cond: ("outer".id = "inner".article_id) -> Seq Scan on article (cost=0.00..20.00 rows=1000 width=4) -> Hash (cost=25.01..25.01 rows=67 width=8) -> Subquery Scan foo (cost=24.17..25.01 rows=67 width=8) -> HashAggregate (cost=24.17..24.34 rows=67 width=8) -> Seq Scan on package (cost=0.00..22.50 rows=334 width=8) Filter: (article_id < 50) Obviously this is on toy tables, but the point is that the constraint does get pushed down through the GROUP BY when appropriate. regards, tom lane
> Obviously this is on toy tables The query is simplified, yes. But the data in the tables is real, albeit they're not that large. > You're misinterpreting it. I might very well be ;) But I also get the feeling I didn't explain to you well enough what I meant... > Without the group by, the plan is a candidate for nestloop-with-inner-index-scan Yes, I understand that. I only ditched the group by to check whether the contraint on the article table was indeed recognized as a constraint on the package table based on 'article.id = foo.article_id'. And it is/was. > with the group by, there's another step in the way. Yep, but on my system, package gets seq-scanned *without* any additional constraint, resulting in a loooooong processing time. > Pushing down into subselects does get done, for instance in CVS tip > I can change the last part of your query to "foo.article_id < 50" > and get ... This is why I think I wasn't clear enough. In the real thing, the constraint on the article table is built by some external source and I cannot easily make assumptions to translate these to a constraint on the package table, especially since I expect the planner to be far better in that than me ;) So, my base query is this: select article.id, p_min from article, (select article_id, min(amount) as p_min from package group by article_id ) as foo where article.id = foo.article_id and <some constraint on article table>; Now, when <constraint> = true, this obviously results in seqscans: Hash Join (cost=1106.79..1251.46 rows=4452 width=8) Hash Cond: ("outer".article_id = "inner".id) -> Subquery Scan foo (cost=726.10..781.74 rows=4451 width=8) -> HashAggregate (cost=726.10..737.23 rows=4451 width=8) -> Seq Scan on package (cost=0.00..635.40 rows=18140 width=8) -> Hash (cost=369.35..369.35 rows=4535 width=4) -> Seq Scan on article (cost=0.00..369.35 rows=4535 width=4) But when <constraint> = 'article.id < 50', only article is indexscanned: Hash Join (cost=730.11..808.02 rows=1 width=8) Hash Cond: ("outer".article_id = "inner".id) -> Subquery Scan foo (cost=726.10..781.74 rows=4451 width=8) -> HashAggregate (cost=726.10..737.23 rows=4451 width=8) -> Seq Scan on package (cost=0.00..635.40 rows=18140 width=8) -> Hash (cost=4.01..4.01 rows=1 width=4) -> Index Scan using article_pkey on article (cost=0.00..4.01 rows=1 width=4) Index Cond: (id < 50) Which still results in poor performance due to the seqscan on package. Putting the constraint on package is boosting performance indeed, but I cannot make that assumption. So, what I was asking was: When the 'article.id < 50' constraint is added, it follows that 'foo.article_id < 50' is a constraint as well. Why is this constraint not used to avoid the seqscan on package? > Obviously this is on toy tables, but the point is that the constraint > does get pushed down through the GROUP BY when appropriate. I've seen it being pushed down when it already was defined as a constraint on the group by, like in your example. If necessary, I'll throw together a few commands that build some example tables to show what I mean. -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > When the 'article.id < 50' constraint is added, it follows that > 'foo.article_id < 50' is a constraint as well. Why is this constraint not > used to avoid the seqscan on package? We don't attempt to make every possible inference (and I don't think you'd like it if we did). The current code will draw inferences about transitive equality, for instance given a = b and b = c it will infer a = c, if all three operators involved are mergejoinable. But given a = b and some arbitrary other constraint on b, it won't consider substituting a into that other constraint. This example doesn't persuade me that it would be worth expending the cycles to do so. Aside from the sheer cost of planning time, there are semantic pitfalls to consider. In some datatypes there are values that are "equal" according to the = operator but are distinguishable by other operators --- for example, zero and minus zero in IEEE-standard float arithmetic. We'd need a great deal of caution in determining what inferences can be drawn. regards, tom lane
> We don't attempt to make every possible inference (and I don't think > you'd like it if we did). I wasn't really asking you to, either ;)) Just trying to achieve a more in-depth understanding of the way things work. > This example doesn't > persuade me that it would be worth expending the cycles to do so. In the real thing I can easily get good processing times by adding an extra join with article inside in the group by and simply use the constraint on that as well, so I'm ok with any choice you make on this. I thought this might have been some kind of special case though, given its occurence on the use of group by. > for example, zero and minus zero in IEEE-standard float arithmetic. Good example, brings back a few memories as well ;) Thanks for your explanation, Tom! -- Best, Frank.