Обсуждение: Subquery in a JOIN not getting restricted?

Поиск
Список
Период
Сортировка

Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
When I run the following query:

select questions.id
from questions
join (
     select u.id as user_id
     from users as u
     left join scores as s
     on s.user_id = u.id
) as subquery
on subquery.user_id = questions.user_id;

the subquery is scanning my entire user table, even though it's restricted
by the outer query.  (My real subquery is much more complicated, of course,
but this is the minimal fail case.)

Is this just not a thing the optimizer can do?  Are there ways to rewrite
this, still as a subquery, that will be smart enough to only produce the one
row of subquery that matches questions.user_id?

Jay Levitt

Re: Subquery in a JOIN not getting restricted?

От
Tom Lane
Дата:
Jay Levitt <jay.levitt@gmail.com> writes:
> When I run the following query:
> select questions.id
> from questions
> join (
>      select u.id as user_id
>      from users as u
>      left join scores as s
>      on s.user_id = u.id
> ) as subquery
> on subquery.user_id = questions.user_id;

> the subquery is scanning my entire user table, even though it's restricted
> by the outer query.  (My real subquery is much more complicated, of course,
> but this is the minimal fail case.)

> Is this just not a thing the optimizer can do?

Every release since 8.2 has been able to reorder joins in a query
written that way.  Probably it just thinks it's cheaper than the
alternatives.

(Unless you've reduced the collapse_limit variables for some reason?)

            regards, tom lane

Re: Subquery in a JOIN not getting restricted?

От
"Kevin Grittner"
Дата:
Jay Levitt <jay.levitt@gmail.com> wrote:
> When I run the following query:
>
> select questions.id
> from questions
> join (
>      select u.id as user_id
>      from users as u
>      left join scores as s
>      on s.user_id = u.id
> ) as subquery
> on subquery.user_id = questions.user_id;
>
> the subquery is scanning my entire user table, even though it's
> restricted by the outer query.  (My real subquery is much more
> complicated, of course, but this is the minimal fail case.)

It's not a fail case -- it's choosing the plan it thinks is cheapest
based on your costing parameters and the statistics gathered by the
latest ANALYZE of the data.

> Is this just not a thing the optimizer can do?  Are there ways to
> rewrite this, still as a subquery, that will be smart enough to
> only produce the one row of subquery that matches
> questions.user_id?

Well, it can certainly produce the plan you seem to want, if it
looks less expensive.  It kinda did with the following script:

create table questions
  (id int not null primary key, user_id int not null);

insert into questions
  select generate_series(1,100), (random()*1000000)::int;

create table users (id int not null primary key);

insert into users select generate_series(1, 1000000);

create table scores
  (id int not null primary key, user_id int not null);

insert into scores select n, n
  from (select generate_series(1,1000000)) x(n);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
     select u.id as user_id
     from users as u
     left join scores as s
     on s.user_id = u.id
) as subquery
on subquery.user_id = questions.user_id;

Here's the plan I got, which scans the questions and then uses the
index to join to the users.  It's throwing the result of that into a
hash table which is then checked from a sequential scan of the
scores table.  If I had made the scores table wider, it might have
gone from the user table to scores on the index.

 Hash Right Join
     (cost=438.23..18614.23 rows=100 width=4)
     (actual time=2.776..161.237 rows=100 loops=1)
   Hash Cond: (s.user_id = u.id)
   ->  Seq Scan on scores s
           (cost=0.00..14425.00 rows=1000000 width=4)
           (actual time=0.025..77.876 rows=1000000 loops=1)
   ->  Hash
           (cost=436.98..436.98 rows=100 width=8)
           (actual time=0.752..0.752 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 4kB
         ->  Nested Loop
                 (cost=0.00..436.98 rows=100 width=8)
                 (actual time=0.032..0.675 rows=100 loops=1)
               ->  Seq Scan on questions
                       (cost=0.00..2.00 rows=100 width=8)
                       (actual time=0.010..0.042 rows=100 loops=1)
               ->  Index Only Scan using users_pkey on users u
                       (cost=0.00..4.34 rows=1 width=4)
                       (actual time=0.005..0.005 rows=1 loops=100)
                     Index Cond: (id = questions.user_id)
 Total runtime: 168.585 ms

If you want help figuring out whether it is choosing the fastest
plan, and how to get it do better if it is not, please read this
page and post the relevant information:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin

Re: Subquery in a JOIN not getting restricted?

От
"Kevin Grittner"
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> If I had made the scores table wider, it might have gone from the
> user table to scores on the index.

Bah.  I just forgot to put an index on scores.user_id.  With that
index available it did what you were probably expecting -- seq scan
on questions, nested loop index scan on users, nested loop index
scan on scores.

You weren't running you test with just a few rows in each table and
expecting the same plan to be generated as for tables with a lot of
rows, were you?

-Kevin

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
Kevin Grittner wrote:
> "Kevin Grittner"<Kevin.Grittner@wicourts.gov>  wrote:
>
>> If I had made the scores table wider, it might have gone from the
>> user table to scores on the index.
>
> Bah.  I just forgot to put an index on scores.user_id.  With that
> index available it did what you were probably expecting -- seq scan
> on questions, nested loop index scan on users, nested loop index
> scan on scores.
>
> You weren't running you test with just a few rows in each table and
> expecting the same plan to be generated as for tables with a lot of
> rows, were you?

No, we're a startup - we only have 2,000 users and 17,000 scores!  We don't
need test databases yet...

But I just realized something I'd completely forgot (or blocked) - scores is
a view.  And views don't have indexes.  The underlying tables are ultimately
indexed by user_id, but I can believe that Postgres doesn't think that's a
cheap way to do it - especially since we're still using stock tuning
settings (I know) so its costs are all screwed up.

And yep!  When I do a CREATE TABLE AS from that view, and add an index on
user_id, it works just as I'd like.  I've been meaning to persist that view
anyway, so that's what I'll do.

Thanks for the push in the right direction..

Jay

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
Jay Levitt wrote:
> And yep! When I do a CREATE TABLE AS from that view, and add an index on
> user_id, it works just as I'd like.

Or not.  Feel free to kick me back over to pgsql-novice, but I don't get why
the GROUP BY in this subquery forces it to scan the entire users table (seq
scan here, index scan on a larger table) when there's only one row in users
that can match:

create table questions (
   id int not null primary key,
   user_id int not null
);
insert into questions
   select generate_series(1,1100), (random()*2000)::int;

create table users (
   id int not null primary key
);
insert into users select generate_series(1, 2000);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
   select u.id
   from users as u
   group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;


Hash Join  (cost=42.28..89.80 rows=2 width=4) (actual time=0.857..1.208
rows=1 loops=1)
    Hash Cond: (u.id = questions.user_id)
    ->  HashAggregate  (cost=34.00..54.00 rows=2000 width=4) (actual
time=0.763..1.005 rows=2000 loops=1)
          ->  Seq Scan on users u  (cost=0.00..29.00 rows=2000 width=4)
(actual time=0.003..0.160 rows=2000 loops=1)
    ->  Hash  (cost=8.27..8.27 rows=1 width=8) (actual time=0.015..0.015
rows=1 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 1kB
          ->  Index Scan using questions_pkey on questions  (cost=0.00..8.27
rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=1)
                Index Cond: (id = 1)
  Total runtime: 1.262 ms

This is on patched 9.0.5 built earlier today.  The real query has
aggregates, so it really does need GROUP BY.. I think..

Jay

Re: Subquery in a JOIN not getting restricted?

От
"Kevin Grittner"
Дата:
Jay Levitt <jay.levitt@gmail.com> wrote:

> I don't get why the GROUP BY in this subquery forces it to scan
> the entire users table (seq scan here, index scan on a larger
> table) when there's only one row in users that can match:

> explain analyze
> select questions.id
> from questions
> join (
>    select u.id
>    from users as u
>    group by u.id
> ) as s
> on s.id = questions.user_id
> where questions.id = 1;

>   Total runtime: 1.262 ms

Are you sure there's a plan significantly faster than 1.3 ms?

That said, there might be some room for an optimization which pushes
that test into the query with the "group by" clause.  I don't know
if there's a problem with that which I'm missing, the construct was
judged to be too rare to be worth the cost of testing for it, or
it's just that nobody has yet gotten to it.

-Kevin

Re: Subquery in a JOIN not getting restricted?

От
Merlin Moncure
Дата:
On Wed, Nov 9, 2011 at 9:15 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Jay Levitt <jay.levitt@gmail.com> wrote:
>
>> I don't get why the GROUP BY in this subquery forces it to scan
>> the entire users table (seq scan here, index scan on a larger
>> table) when there's only one row in users that can match:
>
>> explain analyze
>> select questions.id
>> from questions
>> join (
>>    select u.id
>>    from users as u
>>    group by u.id
>> ) as s
>> on s.id = questions.user_id
>> where questions.id = 1;
>
>>   Total runtime: 1.262 ms
>
> Are you sure there's a plan significantly faster than 1.3 ms?

Well, this may not fit the OP's 'real' query, but the inner subquery
is probably better written as a semi-join (WHERE EXISTS).

merlin

Re: Subquery in a JOIN not getting restricted?

От
"Kevin Grittner"
Дата:
Merlin Moncure <mmoncure@gmail.com> wrote:

> Well, this may not fit the OP's 'real' query

Right, if I recall correctly, the OP said it was simplified down as
far as it could be and still have the issue show.

> but the inner subquery is probably better written as a semi-join
> (WHERE EXISTS).

Well, that doesn't mean that we shouldn't try to optimize perfectly
valid alternative spellings.  Having each logically equivalent way
to write a query generate a different plan is far worse than hints
that are more clearly written.  ;-)

-Kevin

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
Kevin Grittner wrote:
> Jay Levitt<jay.levitt@gmail.com>  wrote:
>
>> I don't get why the GROUP BY in this subquery forces it to scan
>> the entire users table (seq scan here, index scan on a larger
>> table) when there's only one row in users that can match:

> Are you sure there's a plan significantly faster than 1.3 ms?

Yep!  Watch this:

drop schema if exists jaytest cascade;
create schema jaytest;
set search_path to jaytest;

create table questions (
   id int not null primary key,
   user_id int not null
);
insert into questions
   select generate_series(1,1100), (random()*2000000)::int;

create table users (
   id int not null primary key
);
insert into users select generate_series(1, 2000000);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
   select u.id
   from users as u
   group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
  Merge Join  (cost=8.28..90833.02 rows=1818 width=4) (actual
time=888.787..888.790 rows=1 loops=1)
    Merge Cond: (u.id = questions.user_id)
    ->  Group  (cost=0.00..65797.47 rows=2000000 width=4) (actual
time=0.017..735.509 rows=1747305 loops=1)
          ->  Index Scan using users_pkey on users u  (cost=0.00..60797.47
rows=2000000 width=4) (actual time=0.015..331.990 rows=1747305 loops=1)
    ->  Materialize  (cost=8.28..8.29 rows=1 width=8) (actual
time=0.013..0.015 rows=1 loops=1)
          ->  Sort  (cost=8.28..8.28 rows=1 width=8) (actual
time=0.012..0.013 rows=1 loops=1)
                Sort Key: questions.user_id
                Sort Method:  quicksort  Memory: 25kB
                ->  Index Scan using questions_pkey on questions
(cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
                      Index Cond: (id = 1)
  Total runtime: 888.832 ms
(11 rows)

explain analyze
select questions.id
from questions
join (
   select u.id
   from users as u
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
  Nested Loop  (cost=0.00..16.77 rows=1 width=4) (actual time=0.019..0.021
rows=1 loops=1)
    ->  Index Scan using questions_pkey on questions  (cost=0.00..8.27
rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1)
          Index Cond: (id = 1)
    ->  Index Scan using users_pkey on users u  (cost=0.00..8.49 rows=1
width=4) (actual time=0.007..0.007 rows=1 loops=1)
          Index Cond: (u.id = questions.user_id)
  Total runtime: 0.045 ms
(6 rows)

> That said, there might be some room for an optimization which pushes
> that test into the query with the "group by" clause.  I don't know
> if there's a problem with that which I'm missing, the construct was
> judged to be too rare to be worth the cost of testing for it, or
> it's just that nobody has yet gotten to it.

Anyone have more insights on whether this is hard to optimize or simply
not-yet-optimized?  And if the latter, where might I start looking? (Not
that you -really- want me to submit a patch; my C has regressed to the "try
an ampersand.  OK, try an asterisk." level...)

Jay

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
Kevin Grittner wrote:
> Merlin Moncure<mmoncure@gmail.com>  wrote:
>
>> Well, this may not fit the OP's 'real' query
>
> Right, if I recall correctly, the OP said it was simplified down as
> far as it could be and still have the issue show.
>
>> but the inner subquery is probably better written as a semi-join
>> (WHERE EXISTS).

Kevin's right.  The real query involves several SQL and PL/pgsql functions
(all now inlineable), custom aggregates, a union or two and a small coyote.
  I could post it, but that feels like "Please write my code for me".
Still, if you really want to...

Meanwhile, it's good for me to learn how the planner sees my queries and how
I can best state them.  I assume this is me not understanding something
about restrictions across group-by nodes.

If the query was more like

select questions.id
from questions
join (
   select sum(u.id)
   from users as u
   group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;

would you no longer be surprised that it scanned all user rows?  I.E. is the
"group by" a red herring, which usually wouldn't be present without an
aggregate, and the real problem is that the planner can't restrict aggregates?

This comment in planagg.c may be relevant; I'm not doing min/max, but is it
still true that GROUP BY always looks at all the rows, period?

void
preprocess_minmax_aggregates(PlannerInfo *root, List *tlist)
...
/* We don't handle GROUP BY or windowing, because our current
* implementations of grouping require looking at all the rows anyway,
*/

Re: Subquery in a JOIN not getting restricted?

От
Tom Lane
Дата:
Jay Levitt <jay.levitt@gmail.com> writes:
> If the query was more like

> select questions.id
> from questions
> join (
>    select sum(u.id)
>    from users as u
>    group by u.id
> ) as s
> on s.id = questions.user_id
> where questions.id = 1;

> would you no longer be surprised that it scanned all user rows?

Don't hold your breath waiting for that to change.  To do what you're
wishing for, we'd have to treat the GROUP BY subquery as if it were an
inner indexscan, and push a join condition into it.  That's not even
possible today.  It might be possible after I get done with the
parameterized-path stuff I've been speculating about for a couple of
years now; but I suspect that even if it is possible, we won't do it
for subqueries because of the planner-performance hit we'd take from
repeatedly replanning the same subquery.

I'd suggest rephrasing the query to do the join underneath the GROUP BY.

            regards, tom lane

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
> Don't hold your breath waiting for that to change.  To do what you're
> wishing for, we'd have to treat the GROUP BY subquery as if it were an
> inner indexscan, and push a join condition into it.  That's not even
> possible today.

Thanks! Knowing "that's not a thing" helps; we'll just have to rephrase the
query.

Jay

Re: Subquery in a JOIN not getting restricted?

От
Gavin Flower
Дата:
On 10/11/11 09:39, Jay Levitt wrote:
> Kevin Grittner wrote:
>> Jay Levitt<jay.levitt@gmail.com>  wrote:
>>
>>> I don't get why the GROUP BY in this subquery forces it to scan
>>> the entire users table (seq scan here, index scan on a larger
>>> table) when there's only one row in users that can match:
>
>> Are you sure there's a plan significantly faster than 1.3 ms?
>
> Yep!  Watch this:
>
> drop schema if exists jaytest cascade;
> create schema jaytest;
> set search_path to jaytest;
>
> create table questions (
>   id int not null primary key,
>   user_id int not null
> );
> insert into questions
>   select generate_series(1,1100), (random()*2000000)::int;
>
> create table users (
>   id int not null primary key
> );
> insert into users select generate_series(1, 2000000);
>
> vacuum freeze analyze;
>
> explain analyze
> select questions.id
> from questions
> join (
>   select u.id
>   from users as u
>   group by u.id
> ) as s
> on s.id = questions.user_id
> where questions.id = 1;
>
> -----------------------
>  Merge Join  (cost=8.28..90833.02 rows=1818 width=4) (actual
> time=888.787..888.790 rows=1 loops=1)
>    Merge Cond: (u.id = questions.user_id)
>    ->  Group  (cost=0.00..65797.47 rows=2000000 width=4) (actual
> time=0.017..735.509 rows=1747305 loops=1)
>          ->  Index Scan using users_pkey on users u
> (cost=0.00..60797.47 rows=2000000 width=4) (actual time=0.015..331.990
> rows=1747305 loops=1)
>    ->  Materialize  (cost=8.28..8.29 rows=1 width=8) (actual
> time=0.013..0.015 rows=1 loops=1)
>          ->  Sort  (cost=8.28..8.28 rows=1 width=8) (actual
> time=0.012..0.013 rows=1 loops=1)
>                Sort Key: questions.user_id
>                Sort Method:  quicksort  Memory: 25kB
>                ->  Index Scan using questions_pkey on questions
> (cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1
> loops=1)
>                      Index Cond: (id = 1)
>  Total runtime: 888.832 ms
> (11 rows)
>
> explain analyze
> select questions.id
> from questions
> join (
>   select u.id
>   from users as u
> ) as s
> on s.id = questions.user_id
> where questions.id = 1;
>
> -----------------------
>  Nested Loop  (cost=0.00..16.77 rows=1 width=4) (actual
> time=0.019..0.021 rows=1 loops=1)
>    ->  Index Scan using questions_pkey on questions  (cost=0.00..8.27
> rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1)
>          Index Cond: (id = 1)
>    ->  Index Scan using users_pkey on users u  (cost=0.00..8.49 rows=1
> width=4) (actual time=0.007..0.007 rows=1 loops=1)
>          Index Cond: (u.id = questions.user_id)
>  Total runtime: 0.045 ms
> (6 rows)
>
>> That said, there might be some room for an optimization which pushes
>> that test into the query with the "group by" clause.  I don't know
>> if there's a problem with that which I'm missing, the construct was
>> judged to be too rare to be worth the cost of testing for it, or
>> it's just that nobody has yet gotten to it.
>
> Anyone have more insights on whether this is hard to optimize or
> simply not-yet-optimized?  And if the latter, where might I start
> looking? (Not that you -really- want me to submit a patch; my C has
> regressed to the "try an ampersand.  OK, try an asterisk." level...)
>
> Jay
>
Minor note:

'PRIMARY KEY' gives you a 'NOT NULL'  constraint for free.

Re: Subquery in a JOIN not getting restricted?

От
Jay Levitt
Дата:
Tom Lane wrote:
> Jay Levitt<jay.levitt@gmail.com>  writes:
>> If the query was more like
>
>> select questions.id
>> from questions
>> join (
>>     select sum(u.id)
>>     from users as u
>>     group by u.id
>> ) as s
>> on s.id = questions.user_id
>> where questions.id = 1;
>
>> would you no longer be surprised that it scanned all user rows?
>
> I'd suggest rephrasing the query to do the join underneath the GROUP BY.

Well, my real goal is to have that inner query in a set-returning function
that gives a computed table of other users relative to the current user, and
then be able to JOIN that with other things and ORDER BY it:

select questions.id
from questions
join (select * from relevance(current_user)) as r
on r.id = questions.user_id
where questions.id = 1;

I assume there's no way for that function (in SQL or PL/pgSQL) to reach to
the upper node and say "do that join again here", or force the join order
from down below? I can't imagine how there could be, but never hurts to ask.
  Right now, our workaround is to pass the joined target user as a function
parameter and do the JOIN in the function, but that means we have to put the
function in the select list, else we hit the lack of LATERAL support:

  -- This would need LATERAL

select questions.id
from questions
join (
   select * from relevance(current_user, questions.user_id)) as r
)
on r.id = questions.user_id
where questions.id = 1;

  -- This works but has lots of row-at-a-time overhead

select questions.id, (
   select * from relevance(current_user, questions.user_id)
) as r
from questions
where questions.id = 1;

Again, just checking if there's a solution I'm missing.  I know the
optimizer is only asymptotically approaching optimal!

Jay