Обсуждение: why is a constraint not 'pushed down' into a subselect when this subselect is using a 'group by' ??

Поиск
Список
Период
Сортировка
Hi all,

I noticed the following. Given two tables, just simply articles and their
packages:

    article(id int)
    package( id int, article_id int, amount)

When taking the minimum package for articles given some constraint on the
article table

    select
        article.id,
        p_min
    from
        article,
        (select
            article_id,
            min(amount) as p_min
        from
            package
        group by
            article_id
        ) as foo
    where
        article.id = foo.article_id and
        article.id < 50;

The query plan shows

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Hash Join  (cost=730.11..808.02 rows=1 width=8)
   Hash Cond: ("outer".article_id = "inner".id)
   ->  Subquery Scan foo  (cost=726.10..781.74 rows=4451 width=8)
         ->  HashAggregate  (cost=726.10..737.23 rows=4451 width=8)
               ->  Seq Scan on package  (cost=0.00..635.40 rows=18140 width=8)
   ->  Hash  (cost=4.01..4.01 rows=1 width=4)
         ->  Index Scan using article_pkey on article  (cost=0.00..4.01 rows=1
width=4)
               Index Cond: (id < 50)

So, a seq scan on the complete package table is done, although it seems the
planner could have deduced that only the part that has 'article_id < 50'
would be relevant, since 'article.id < 50' and 'article.id = foo.article_id'

If I ditch the group by, then this contraint does get pushed into the
subselect :

    select
        article.id,
        p_min
    from
        article,
        (select
            article_id,
            amount as p_min
        from
            package
        ) as foo
    where
        article.id = foo.article_id and
        article.id < 50;


                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..14.22 rows=5 width=8)
   ->  Index Scan using article_pkey on article  (cost=0.00..4.01 rows=1
width=4)
         Index Cond: (id < 50)
   ->  Index Scan using package_idx3 on package  (cost=0.00..10.16 rows=4
width=8)
         Index Cond: ("outer".id = package.article_id)


So it seems the math-knowledge is there ;)
I'm wondering about what I'm overlooking here. When I take the first query and
add the article constraint 'manually', I get:

    select
        article.id,
        p_min
    from
        article,
        (select
            article_id,
            min(amount) as p_min
        from
            package
        where
            article_id < 50
        group by
            article_id
        ) as foo
    where
        article.id = foo.article_id and
        article.id < 50;

                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..9.65 rows=1 width=8)
   Join Filter: ("outer".id = "inner".article_id)
   ->  Index Scan using article_pkey on article  (cost=0.00..4.01 rows=1
width=4)
         Index Cond: (id < 50)
   ->  Subquery Scan foo  (cost=0.00..5.62 rows=1 width=8)
         ->  GroupAggregate  (cost=0.00..5.61 rows=1 width=8)
               ->  Index Scan using package_idx3 on package  (cost=0.00..5.60
rows=2 width=8)
                     Index Cond: (article_id < 50)


which would have been a nice plan for the first query ;)


Could someone shine a light on this, please?



--
Best,




Frank.


Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> If I ditch the group by, then this contraint does get pushed into the
> subselect :

You're misinterpreting it.  Without the group by, the plan is a
candidate for nestloop-with-inner-index-scan; with the group by,
there's another step in the way.

Pushing down into subselects does get done, for instance in CVS tip
I can change the last part of your query to "foo.article_id < 50"
and get

 Hash Join  (cost=25.18..53.53 rows=335 width=8)
   Hash Cond: ("outer".id = "inner".article_id)
   ->  Seq Scan on article  (cost=0.00..20.00 rows=1000 width=4)
   ->  Hash  (cost=25.01..25.01 rows=67 width=8)
         ->  Subquery Scan foo  (cost=24.17..25.01 rows=67 width=8)
               ->  HashAggregate  (cost=24.17..24.34 rows=67 width=8)
                     ->  Seq Scan on package  (cost=0.00..22.50 rows=334 width=8)
                           Filter: (article_id < 50)

Obviously this is on toy tables, but the point is that the constraint
does get pushed down through the GROUP BY when appropriate.

            regards, tom lane


> Obviously this is on toy tables

The query is simplified, yes. But the data in the tables is real, albeit
they're not that large.

> You're misinterpreting it.

I might very well be ;)
But I also get the feeling I didn't explain to you well enough what I meant...

> Without the group by, the plan is a candidate for
nestloop-with-inner-index-scan

Yes, I understand that. I only ditched the group by to check whether the
contraint on the article table was indeed recognized as a constraint on the
package table based on 'article.id = foo.article_id'. And it is/was.

> with the group by, there's another step in the way.

Yep, but on my system, package gets seq-scanned *without* any additional
constraint, resulting in a loooooong processing time.

> Pushing down into subselects does get done, for instance in CVS tip
> I can change the last part of your query to "foo.article_id < 50"
> and get ...

This is why I think I wasn't clear enough.

In the real thing, the constraint on the article table is built by some
external source and I cannot easily make assumptions to translate these to a
constraint on the package table, especially since I expect the planner to be
far better in that than me ;)

So, my base query is this:

    select
        article.id, p_min
    from
        article,
        (select
            article_id, min(amount) as p_min
        from
            package
        group by
            article_id
        ) as foo
    where
        article.id = foo.article_id and
        <some constraint on article table>;


Now, when <constraint> = true, this obviously results in seqscans:

 Hash Join  (cost=1106.79..1251.46 rows=4452 width=8)
   Hash Cond: ("outer".article_id = "inner".id)
   ->  Subquery Scan foo  (cost=726.10..781.74 rows=4451 width=8)
         ->  HashAggregate  (cost=726.10..737.23 rows=4451 width=8)
               ->  Seq Scan on package  (cost=0.00..635.40 rows=18140 width=8)
   ->  Hash  (cost=369.35..369.35 rows=4535 width=4)
         ->  Seq Scan on article  (cost=0.00..369.35 rows=4535 width=4)

But when <constraint> = 'article.id < 50', only article is indexscanned:

 Hash Join  (cost=730.11..808.02 rows=1 width=8)
   Hash Cond: ("outer".article_id = "inner".id)
   ->  Subquery Scan foo  (cost=726.10..781.74 rows=4451 width=8)
         ->  HashAggregate  (cost=726.10..737.23 rows=4451 width=8)
               ->  Seq Scan on package  (cost=0.00..635.40 rows=18140 width=8)
   ->  Hash  (cost=4.01..4.01 rows=1 width=4)
         ->  Index Scan using article_pkey on article  (cost=0.00..4.01 rows=1
width=4)
               Index Cond: (id < 50)


Which still results in poor performance due to the seqscan on package.

Putting the constraint on package is boosting performance indeed, but I cannot
make that assumption.

So, what I was asking was:

When the 'article.id < 50' constraint is added, it follows that
'foo.article_id < 50' is a constraint as well. Why is this constraint not
used to avoid the seqscan on package?

> Obviously this is on toy tables, but the point is that the constraint
> does get pushed down through the GROUP BY when appropriate.

I've seen it being pushed down when it already was defined as a constraint on
the group by, like in your example.

If necessary, I'll throw together a few commands that build some example
tables to show what I mean.



--
Best,




Frank.


Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> When the 'article.id < 50' constraint is added, it follows that
> 'foo.article_id < 50' is a constraint as well. Why is this constraint not
> used to avoid the seqscan on package?

We don't attempt to make every possible inference (and I don't think
you'd like it if we did).  The current code will draw inferences about
transitive equality, for instance given a = b and b = c it will infer
a = c, if all three operators involved are mergejoinable.  But given
a = b and some arbitrary other constraint on b, it won't consider
substituting a into that other constraint.  This example doesn't
persuade me that it would be worth expending the cycles to do so.

Aside from the sheer cost of planning time, there are semantic
pitfalls to consider.  In some datatypes there are values that are
"equal" according to the = operator but are distinguishable by other
operators --- for example, zero and minus zero in IEEE-standard
float arithmetic.  We'd need a great deal of caution in determining
what inferences can be drawn.

            regards, tom lane

> We don't attempt to make every possible inference (and I don't think
> you'd like it if we did).

I wasn't really asking you to, either ;))

Just trying to achieve a more in-depth understanding of the way things work.

> This example doesn't
> persuade me that it would be worth expending the cycles to do so.

In the real thing I can easily get good processing times by adding an extra
join with article inside in the group by and simply use the constraint on
that as well, so I'm ok with any choice you make on this.

I thought this might have been some kind of special case though, given its
occurence on the use of group by.

> for example, zero and minus zero in IEEE-standard float arithmetic.

Good example, brings back a few memories as well ;)




Thanks for your explanation, Tom!



--
Best,




Frank.