Обсуждение: An Analyze question

Поиск
Список
Период
Сортировка

An Analyze question

От
"Nick Fankhauser"
Дата:
Hello-

Does analyze consider the entire key when creating statistics for a table,
or a substring composed of the leading chars?

Some background:

I've just modified *all* of the keys on my database by prefixing them with
the same source_id that is 5 chars long. This is in preparation for merging
together data from several different sources where I know that the keys are
unique to the source, but not necessarily between sources.

So on every primary & foreign key, I have executed this update:

update table set key = '18105'||key;

Now, a few queries that used to be swift are very slow, and on further
investigation, I found that the planner is making different decisions-
essentially it looks like the statistics now indicate that each key is much
less selective & hence a poor candidate for an index scan.

I did an analyze on the whole database, and then did an analyze specifically
on the tables involved, and checked against an original copy of the database
to make sure the indexes are identical in  both.

If the keys are taken in their entirety, nothing has changed- they are just
as selective as ever. However, if only the leading chars are considered, or
if the leading chars have a higher weighting, they would certainly appear
much less selective.

Any thoughts on what happened here?

Thanks-

-Nick

PS: The before & after explains are pasted in below:

Before:

monroe=# explain select * from actor_cases where actor_id = '18105A7313 53';
NOTICE:  QUERY PLAN:

Merge Join  (cost=27748.94..27807.92 rows=145 width=192)
  ->  Sort  (cost=27713.16..27713.16 rows=3410 width=144)
        ->  Nested Loop  (cost=0.00..27372.75 rows=3410 width=144)
              ->  Index Scan using actor_case_assignment_both on
actor_case_assignment  (cost=0.00..11766.67 rows=3410 width=24)
              ->  Index Scan using case_data_case_id on case_data
(cost=0.00..4.56 rows=1 width=120)
  ->  Sort  (cost=35.78..35.78 rows=522 width=48)
        ->  Seq Scan on local_case_type  (cost=0.00..12.22 rows=522
width=48)



After:

develop=#  explain select * from actor_cases where actor_id = '18105A7313
53';
NOTICE:  QUERY PLAN:

Hash Join  (cost=27801.99..53031.15 rows=306 width=192)
  ->  Hash Join  (cost=27788.47..51957.43 rows=11377 width=144)
        ->  Seq Scan on case_data  (cost=0.00..6932.35 rows=226535
width=120)
        ->  Hash  (cost=27693.03..27693.03 rows=11377 width=24)
              ->  Seq Scan on actor_case_assignment  (cost=0.00..27693.03
rows=11377 width=24)
  ->  Hash  (cost=12.22..12.22 rows=522 width=48)
        ->  Seq Scan on local_case_type  (cost=0.00..12.22 rows=522
width=48)




--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/


Re: An Analyze question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
> So on every primary & foreign key, I have executed this update:
> update table set key = '18105'||key;
> Now, a few queries that used to be swift are very slow,

Could we see the queries?  (No, I do not remember your view definitions.)

Offhand I would think that 7.2 is smart enough to deal with this
situation; at least it is supposed to.  There might be some cases
involving LIKE/regex matching where it would be overly impressed by
the length of the fixed pattern, but for simple equality joins I
don't see why this change would make any difference.

            regards, tom lane

Re: An Analyze question

От
"Nick Fankhauser"
Дата:
> Could we see the queries?  (No, I do not remember your view definitions.)

Sure! I'll append the details below. (I was hoping we had correctly guessed
the cause & you wouldn't need details...)


> Offhand I would think that 7.2 is smart enough to deal with this

We're on 7.1.3. We're working to meet some tight deadlines, so the 7.2
upgrade is about 3 weeks off. I'm leery of going to 7.2 right away because
we make heavy use of timestamps & I've noted that quite a few folks have
tripped over timestamp issues in the upgrade process, so I figure we'll want
some breathing room when we upgrade.

-NF



The details:

actor_cases is a view:

create view actor_cases as
select
      actor_case_assignment.actor_id,
      case_data.case_id,
      case_data.case_public_id,
      case_data.court_id,
      case_data.case_filed_date,
      case_data.case_disposition_date,
      case_data.case_reopen_date,
      case_data.global_case_type_code,
      case_data.local_case_type_code,
      case_data.case_disp_local_code,
      case_data.case_disp_global_code,
      case_data.case_title,
      local_case_type.local_case_type_desc,
    local_case_type.global_case_type_desc
 from
    actor_case_assignment,
      case_data,
      local_case_type
where
      actor_case_assignment.case_id = case_data.case_id
  and case_data.court_id = local_case_type.court_id
  and case_data.local_case_type_code = local_case_type.local_case_type_code;


local_case_type is a trivial lookup table with about 500 rows, so it has no
indexes.

case_data has a unique index on case_id.
case_data.court_id only has about 10 distinct values, so there is no index.
There are about 200,000 records in case_data.


actor_case_assignment has indexes on:
(actor_id, case_id) -unique
(actor_id)
There are about 1,000,000 records in actor_case_assignment.

Here are the before & after explains again:

Before:

monroe=# explain select * from actor_cases where actor_id = '18105A7313 53';
NOTICE:  QUERY PLAN:

Merge Join  (cost=27748.94..27807.92 rows=145 width=192)
  ->  Sort  (cost=27713.16..27713.16 rows=3410 width=144)
        ->  Nested Loop  (cost=0.00..27372.75 rows=3410 width=144)
              ->  Index Scan using actor_case_assignment_both on
actor_case_assignment  (cost=0.00..11766.67 rows=3410 width=24)
              ->  Index Scan using case_data_case_id on case_data
(cost=0.00..4.56 rows=1 width=120)
  ->  Sort  (cost=35.78..35.78 rows=522 width=48)
        ->  Seq Scan on local_case_type  (cost=0.00..12.22 rows=522
width=48)



After:

develop=#  explain select * from actor_cases where actor_id = '18105A7313
53';
NOTICE:  QUERY PLAN:

Hash Join  (cost=27801.99..53031.15 rows=306 width=192)
  ->  Hash Join  (cost=27788.47..51957.43 rows=11377 width=144)
        ->  Seq Scan on case_data  (cost=0.00..6932.35 rows=226535
width=120)
        ->  Hash  (cost=27693.03..27693.03 rows=11377 width=24)
              ->  Seq Scan on actor_case_assignment  (cost=0.00..27693.03
rows=11377 width=24)
  ->  Hash  (cost=12.22..12.22 rows=522 width=48)
        ->  Seq Scan on local_case_type  (cost=0.00..12.22 rows=522
width=48)


And the difference between before & after is these update statements:

update case_data set case_id = '18105'||case_id;
update case_data set court_id = '18105'||court_id;
update actor_case_assignment set actor_case_assignment_id =
'18105'||actor_case_assignment_id;
update actor_case_assignment set actor_id = '18105'||actor_id;
update actor_case_assignment set case_id = '18105'||case_id;
update local_case_type set court_id = '18105'||court_id;

All of the "id" fields are varchar(50);

There were similar updates on all keys in the DB, but this is everything I
did on the involved tables. A vacuum analyze was run after the updates.


Re: An Analyze question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
>> Offhand I would think that 7.2 is smart enough to deal with this

> We're on 7.1.3.

Hmm...  I don't see why it would make any difference in 7.1
either.  You weren't by any chance selecting on the most common
actor_id were you?  Really 7.1 only has two possible selectivity
estimates for column = 'constant' --- either the constant is the
stored most-common-value, or it ain't.  What do you get from

select attname,attdispersion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'actor_case_assignment';

in each database?

> Here are the before & after explains again:
> Before:
> monroe=# explain select * from actor_cases where actor_id = '18105A7313 53';
> After:
> develop=# explain select * from actor_cases where actor_id = '18105A7313 53';

Are you really comparing apples to apples here?  I'd think that a proper
comparison of plans would be

monroe=# explain select * from actor_cases where actor_id = 'A7313 53';
develop=# explain select * from actor_cases where actor_id = '18105A7313 53';

            regards, tom lane

Re: An Analyze question

От
"Nick Fankhauser"
Дата:
> select attname,attdispersion,s.*
> from pg_statistic s, pg_attribute a, pg_class c
> where starelid = c.oid and attrelid = c.oid and staattnum = attnum
> and relname = 'actor_case_assignment';
>
> in each database?
Here are the results:


The "Before" database:

         attname          | attdispersion | starelid | staattnum | staop |
stanullfrac | stacommonfrac | stacommonval  |          staloval           |
stahival
--------------------------+---------------+----------+-----------+-------+--
-----------+---------------+---------------+-----------------------------+--
---------------------
 actor_id                 |    0.00761992 |  2591915 |         1 |  1066 |
0 |     0.0335872 | JTW           | 18105A08315 49              | XS
 case_id                  |    1.4583e-05 |  2591915 |         2 |  1066 |
0 |   7.28935e-05 | 501CP00243    | 02145556                    | YW02647802
 assigned_case_role       |      0.100296 |  2591915 |         6 |  1066 |
0.000108355 |      0.250493 | DEFENDANT     | Attorney                    |
THIRD PARTY PLAINTIFF
 actor_case_assignment_id |            -1 |  2591915 |        17 |  1066 |
0 |   9.85048e-07 | X693CF00396-S | 18105A100DF00438-S-17369 53 | XTV8605
685E-S
(4 rows)



The "After" database:

         attname          | attdispersion | starelid | staattnum | staop |
stanullfrac | stacommonfrac |    stacommonval    |          staloval
|       stahival
--------------------------+---------------+----------+-----------+-------+--
-----------+---------------+--------------------+---------------------------
-+-----------------------
 actor_id                 |       0.03247 | 33979335 |         1 |  1066 |
0 |      0.112092 | 18105XS            | 18105A                     |
18105XS
 case_id                  |   1.22202e-05 | 33979335 |         2 |  1066 |
0 |    6.1086e-05 | 18105498CP01613    | 1810502145556              |
18105YW02647802
 assigned_case_role       |     0.0575305 | 33979335 |         6 |  1066 |
0.000108378 |      0.170868 | Attorney           | Attorney
| THIRD PARTY PLAINTIFF
 actor_case_assignment_id |            -1 | 33979335 |        17 |  1066 |
0 |   9.85259e-07 | 18105X202CM01929-S | 18105A100CF00018-1-7888 53 |
18105XTV8605 685E-S
(4 rows)


> Are you really comparing apples to apples here?

You're correct, but in this case the plans are the same if I use the value
you suggest. The value I'm searching for is not a common one, so presumably,
I'm getting the dispersion for a value that isn't the "most-common-value".

The records in this table fall into three categories generally:

Judges that have ten thousands of cases-
Attorneys that have thousands of cases-
Average Joes who have one or two cases-

This means that a typical actor_id will have one record, but the average is
probably about 2000.

Thanks!

-Nick


Re: An Analyze question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
> The "Before" database:

>          attname          | attdispersion | starelid | staattnum | staop |
> stanullfrac | stacommonfrac | stacommonval  |          staloval           |
> stahival
> --------------------------+---------------+----------+-----------+-------+--
> -----------+---------------+---------------+-----------------------------+--
> ---------------------
>  actor_id                 |    0.00761992 |  2591915 |         1 |  1066 |
> 0 |     0.0335872 | JTW           | 18105A08315 49              | XS

> The "After" database:

>          attname          | attdispersion | starelid | staattnum | staop |
> stanullfrac | stacommonfrac |    stacommonval    |          staloval
> |       stahival
> --------------------------+---------------+----------+-----------+-------+--
> -----------+---------------+--------------------+---------------------------
> -+-----------------------
>  actor_id                 |       0.03247 | 33979335 |         1 |  1066 |
> 0 |      0.112092 | 18105XS            | 18105A                     |
> 18105XS

Hm.  In the "before" case you are showing JTW as the most common
actor_id, with a frequency of 3.36% of the entries (busy judge,
I suppose).

In the "after" case you are showing "18105XS" as the most common
actor_id, with a frequency of 11.2% of the entries.  Where'd that
come from?  Is it correct?

The change in plans occurs because of the increase in stacommonfrac.
7.1 basically uses stacommonfrac as the selectivity estimate if the
constant is the stacommonval, otherwise stacommonfrac/10.  So if
stacommonfrac goes up, so does the estimated number of rows retrieved,
and that's what's changing your plan.

            regards, tom lane

Re: An Analyze question

От
"Nick Fankhauser"
Дата:
> In the "after" case you are showing "18105XS" as the most common
> actor_id, with a frequency of 11.2% of the entries.  Where'd that
> come from?  Is it correct?

I believe this is correct, and the reason I've not been getting poor
performance on the old database is that the stats are not up to date on the
*old* DB. I've been so focused on problems with the updated DB that I didn't
suspect that the database that was performing well had the "bad" stats.


> 7.1 basically uses stacommonfrac as the selectivity estimate if the
> constant is the stacommonval, otherwise stacommonfrac/10.  So if
> stacommonfrac goes up, so does the estimated number of rows retrieved,
> and that's what's changing your plan.

I understand. That piece of info put us on the track of a solution.

In this case, we've got an unusual distribution that looks like this:

Among the actors to which cases may be assigned:

The State gets 10% of the cases
8 Judges get 3.5% of the cases each
50 Attorneys get about 0.1% each
The remaining 388,000 actors get about 0.001% each.

Given this unusual distribution, the planner can't predict well, so we're
thinking that the best way to handle this is to set up a script to do our
vacuum analyze, and then update stacommonfrac to be .01 for this particular
field.

This should give us great performance for the vast majority of the possible
queries. It looks like the 8 judges are on the borderline, and we'll get
poor performance if anyone happens to query on the State, but most of our
users should understand that would be a silly query anyway.

Is there another more graceful way to do this?

Tom- Thanks for your patient help with this. We'll be sure to try this
without the planner tweaks when we upgrade to 7.2 & let you know how it
goes.

regards,

-Nick







Re: An Analyze question

От
Tom Lane
Дата:
"Nick Fankhauser" <nickf@ontko.com> writes:
> In this case, we've got an unusual distribution that looks like this:

> Among the actors to which cases may be assigned:

> The State gets 10% of the cases
> 8 Judges get 3.5% of the cases each
> 50 Attorneys get about 0.1% each
> The remaining 388,000 actors get about 0.001% each.

> Given this unusual distribution, the planner can't predict well, so we're
> thinking that the best way to handle this is to set up a script to do our
> vacuum analyze, and then update stacommonfrac to be .01 for this particular
> field.

That's probably your best answer for 7.1 --- just force a suitable
fudge-factor into pg_statistic after any VACUUM ANALYZE run.

In 7.2, you could set the statistics target for actor_id to be 60 or 100
or so, and then the system would actually *know* the above distribution,
and moreover would know the names of the judges and the attorneys.
It'll be interesting to see how the plans change depending on whether
you are searching for a judge or not ...

            regards, tom lane

Re: An Analyze question

От
"Ross J. Reedstrom"
Дата:
On Mon, Apr 22, 2002 at 04:32:37PM -0400, Tom Lane wrote:
> It'll be interesting to see how the plans change depending on whether
> you are searching for a judge or not ...

A whole new meaning to judge shopping, err, searching?

Ross