Обсуждение: A *short* planner question

Поиск
Список
Период
Сортировка

A *short* planner question

От
"Nick Fankhauser"
Дата:
I know I'm about to become a pest, but I promise, this is a short one!

Before doing the explain below, I specifically did a verbose analyze & noted
that the results seemed in line with what I expected. I'm on v7.1.3 of PGSQL

Here's the query that runs too slow: (It takes about 30 seconds on a 1.2Ghz
Athlon system with 512MB RAM)

monroe=# explain
monroe-# select
monroe-# actor.actor_id,
monroe-# actor.role_class_code,
monroe-# actor.actor_full_name,
monroe-# actor.actor_person_date_of_birth
monroe-# from actor
monroe-# where exists (select 'x'
monroe(#               from actor_case_assignment,case_data
monroe(#               where actor_case_assignment.actor_id = actor.actor_id
monroe(#               and   actor_case_assignment.case_id =
case_data.case_id
monroe(#               and   case_data.case_disp_global_code = 'Open')
monroe-#   and upper(actor_full_name) like 'RAY, J%';

NOTICE:  QUERY PLAN:

Index Scan using actor_upper_full_name on actor  (cost=0.00..1284478.54
rows=2799 width=40)
  SubPlan
    ->  Nested Loop  (cost=0.00..19291.12 rows=42 width=24)
          ->  Seq Scan on case_data  (cost=0.00..5424.69 rows=2871 width=12)
          ->  Index Scan using actor_case_assignment_both on
actor_case_assignment  (cost=0.00..4.82 rows=1 width=12)

The index actor_case_assignment_both is on (case_id, actor_id) in that
order.
There is also an index on actor_case_assigment(actor_id).
There is a unique index on case_data(case_id).
The is an index on upper(actor_full_name).
case_data.case_disp_global_code has only two distinct values ('Open',
'Closed').
actor contains 279855 rows.
case_data contains 168775 rows.
actor_case_assignment contains 753610 rows.

So... why wouldn't the planner do this:

Index scan on actor for upper(actor_full_name) like 'RAY, J%' returns a
small set (rows < 1000)
For each actor_id in the set:
    do an Index Scan using actor_case_assignment.actor_id to return
actor_case_assignment.case_id (rows < 2000)
    For each case_id:
       do an index scan to return case_data.case_disp_global_code
       Scan each code for 'Open' (rows < 100)

Any thoughts on how I might encourage the planner to use such a strategy?

Another thing that doesn't make sense to me, but doesn't hurt the plan, is
that although we have analyze stats on actor, the planner seems to be using
the default of .01 on upper(actor_full_name) and predicts that 2799 rows
will be returned, however, if I get rid of the upper(), it uses the stats
and predicts that 1 row will be returned. Does the use of an index on a
function make the planner stop using the stats?

Thanks,

-Nick






Re: A *short* planner question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
> So... why wouldn't the planner do this:

Offhand I think the planner should have considered that plan; evidently
it thought it was more expensive than this plan.  (Perhaps it was right;
how selective is the actor_case_assignment(actor_id) index?)  You could
check by temporarily dropping the actor_case_assignment_both index and
seeing what plan you get.  You don't even have to really drop it; try

    begin;
    drop index ...;
    explain ...;
    rollback;

Should work, without the pain of rebuilding the index afterwards...

> Another thing that doesn't make sense to me, but doesn't hurt the plan, is
> that although we have analyze stats on actor, the planner seems to be using
> the default of .01 on upper(actor_full_name) and predicts that 2799 rows
> will be returned, however, if I get rid of the upper(), it uses the stats
> and predicts that 1 row will be returned. Does the use of an index on a
> function make the planner stop using the stats?

There are no stats computed for the values of functional indexes, at
present, thus no way for the planner to derive any realistic estimate for
the selectivity of that clause.  I have a private TODO item about that,
but I'm not sure if the public TODO list mentions it.

            regards, tom lane

Re: A *short* planner question

От
"Nick Fankhauser"
Дата:
> You could
> check by temporarily dropping the actor_case_assignment_both index and
> seeing what plan you get.

Here is the result:

Index Scan using actor_upper_full_name on actor  (cost=0.00..1544484.16
rows=3051 width=40)
  SubPlan
    ->  Nested Loop  (cost=0.00..21275.72 rows=42 width=24)
          ->  Index Scan using actor_case_assignment_fk1 on
actor_case_assignment  (cost=0.00..9221.62 rows=2696 width=12)
          ->  Index Scan using case_data_case_id on case_data
(cost=0.00..4.46 rows=1 width=12)

Lightning-fast, but I need that index on both ids for other purposes.

The problem is that I need the index on both foreign keys because I use it
to kick out duplicate entry attempts during my load process. (Duplicate
actors are ok, and duplicate cases are ok, but an actor can only be assigned
to a case once, so the combination must be unique.)

Fortunately, your info on the function index not using stats got me thinking
in a profitable direction:

I'm using the "exists" subquery only because when I used "distinct", I got
even worse performance. I think this was because the planner chose not to
use my index on upper(actor_full_name) when I simply joined all of the
tables. If I reword the query to use "distinct", and eliminate also the
upper() on my constraint, the query really flies.

So I think my solution will be to add a new column called
"upper_actor_full_name" to my "actor" table, and add a bit of code to my
load process that will populate this field with upper(actor_full_name). It's
a bit of a kludge, but should work until the day that you get to adding
stats for function indexes.

Thanks for the help.

-Nick

--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/


Re: A *short* planner question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
>> You could
>> check by temporarily dropping the actor_case_assignment_both index and
>> seeing what plan you get.

> Here is the result:

> Index Scan using actor_upper_full_name on actor  (cost=0.00..1544484.16
> rows=3051 width=40)
>   SubPlan
>     ->  Nested Loop  (cost=0.00..21275.72 rows=42 width=24)
>           ->  Index Scan using actor_case_assignment_fk1 on
> actor_case_assignment  (cost=0.00..9221.62 rows=2696 width=12)
>           ->  Index Scan using case_data_case_id on case_data
> (cost=0.00..4.46 rows=1 width=12)

> Lightning-fast, but I need that index on both ids for other purposes.

Hmm.  If the outer side of the nestloop were actually hitting 2696 rows
on average, it wouldn't be "lightning fast".  So the planner's failure
to choose this plan is probably due to this overestimate.  You said you
were still on 7.1.*, right?  It'd be interesting to know if 7.2 gets
this right; it has more detailed stats and hopefully would make a better
estimate of the number of matches.

            regards, tom lane

Re: A *short* planner question

От
"Nick Fankhauser"
Дата:
> Hmm.  If the outer side of the nestloop were actually hitting 2696 rows
> on average, it wouldn't be "lightning fast".  So the planner's failure
> to choose this plan is probably due to this overestimate.

This may be a reasonable estimate- as you might have gathered, we're
matching people to court cases. The typical person (hopefully) only has one
case that they're involved with, but an attorney may be in the 1K-10K range
and a judge in 10k-100K. Since they all live in the same table, the estimate
may not be all that far off. In addition, since we're using a "like", typing
in only an initial char would get roughly 1/26 of the rows back- Does the
planner look at the number of leading chars provided when coming up with a
plan?

> still on 7.1.*, right?  It'd be interesting to know if 7.2 gets
> this right; it has more detailed stats and hopefully would make a better
> estimate of the number of matches.

7.2 on our development box is on the middle-priority TODO list. (2-4 weeks.)
I'll provide feedback when we get there.

As always, thanks for the help!

-Nick

--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/





Re: A *short* planner question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
> ... In addition, since we're using a "like", typing
> in only an initial char would get roughly 1/26 of the rows back- Does the
> planner look at the number of leading chars provided when coming up with a
> plan?

Yeah it does, if it can see a constant pattern for the LIKE or regex match.

            regards, tom lane