Обсуждение: An Analyze question
Hello- Does analyze consider the entire key when creating statistics for a table, or a substring composed of the leading chars? Some background: I've just modified *all* of the keys on my database by prefixing them with the same source_id that is 5 chars long. This is in preparation for merging together data from several different sources where I know that the keys are unique to the source, but not necessarily between sources. So on every primary & foreign key, I have executed this update: update table set key = '18105'||key; Now, a few queries that used to be swift are very slow, and on further investigation, I found that the planner is making different decisions- essentially it looks like the statistics now indicate that each key is much less selective & hence a poor candidate for an index scan. I did an analyze on the whole database, and then did an analyze specifically on the tables involved, and checked against an original copy of the database to make sure the indexes are identical in both. If the keys are taken in their entirety, nothing has changed- they are just as selective as ever. However, if only the leading chars are considered, or if the leading chars have a higher weighting, they would certainly appear much less selective. Any thoughts on what happened here? Thanks- -Nick PS: The before & after explains are pasted in below: Before: monroe=# explain select * from actor_cases where actor_id = '18105A7313 53'; NOTICE: QUERY PLAN: Merge Join (cost=27748.94..27807.92 rows=145 width=192) -> Sort (cost=27713.16..27713.16 rows=3410 width=144) -> Nested Loop (cost=0.00..27372.75 rows=3410 width=144) -> Index Scan using actor_case_assignment_both on actor_case_assignment (cost=0.00..11766.67 rows=3410 width=24) -> Index Scan using case_data_case_id on case_data (cost=0.00..4.56 rows=1 width=120) -> Sort (cost=35.78..35.78 rows=522 width=48) -> Seq Scan on local_case_type (cost=0.00..12.22 rows=522 width=48) After: develop=# explain select * from actor_cases where actor_id = '18105A7313 53'; NOTICE: QUERY PLAN: Hash Join (cost=27801.99..53031.15 rows=306 width=192) -> Hash Join (cost=27788.47..51957.43 rows=11377 width=144) -> Seq Scan on case_data (cost=0.00..6932.35 rows=226535 width=120) -> Hash (cost=27693.03..27693.03 rows=11377 width=24) -> Seq Scan on actor_case_assignment (cost=0.00..27693.03 rows=11377 width=24) -> Hash (cost=12.22..12.22 rows=522 width=48) -> Seq Scan on local_case_type (cost=0.00..12.22 rows=522 width=48) -------------------------------------------------------------------------- Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
"Nick Fankhauser" <nickf@ontko.com> writes: > So on every primary & foreign key, I have executed this update: > update table set key = '18105'||key; > Now, a few queries that used to be swift are very slow, Could we see the queries? (No, I do not remember your view definitions.) Offhand I would think that 7.2 is smart enough to deal with this situation; at least it is supposed to. There might be some cases involving LIKE/regex matching where it would be overly impressed by the length of the fixed pattern, but for simple equality joins I don't see why this change would make any difference. regards, tom lane
> Could we see the queries? (No, I do not remember your view definitions.) Sure! I'll append the details below. (I was hoping we had correctly guessed the cause & you wouldn't need details...) > Offhand I would think that 7.2 is smart enough to deal with this We're on 7.1.3. We're working to meet some tight deadlines, so the 7.2 upgrade is about 3 weeks off. I'm leery of going to 7.2 right away because we make heavy use of timestamps & I've noted that quite a few folks have tripped over timestamp issues in the upgrade process, so I figure we'll want some breathing room when we upgrade. -NF The details: actor_cases is a view: create view actor_cases as select actor_case_assignment.actor_id, case_data.case_id, case_data.case_public_id, case_data.court_id, case_data.case_filed_date, case_data.case_disposition_date, case_data.case_reopen_date, case_data.global_case_type_code, case_data.local_case_type_code, case_data.case_disp_local_code, case_data.case_disp_global_code, case_data.case_title, local_case_type.local_case_type_desc, local_case_type.global_case_type_desc from actor_case_assignment, case_data, local_case_type where actor_case_assignment.case_id = case_data.case_id and case_data.court_id = local_case_type.court_id and case_data.local_case_type_code = local_case_type.local_case_type_code; local_case_type is a trivial lookup table with about 500 rows, so it has no indexes. case_data has a unique index on case_id. case_data.court_id only has about 10 distinct values, so there is no index. There are about 200,000 records in case_data. actor_case_assignment has indexes on: (actor_id, case_id) -unique (actor_id) There are about 1,000,000 records in actor_case_assignment. Here are the before & after explains again: Before: monroe=# explain select * from actor_cases where actor_id = '18105A7313 53'; NOTICE: QUERY PLAN: Merge Join (cost=27748.94..27807.92 rows=145 width=192) -> Sort (cost=27713.16..27713.16 rows=3410 width=144) -> Nested Loop (cost=0.00..27372.75 rows=3410 width=144) -> Index Scan using actor_case_assignment_both on actor_case_assignment (cost=0.00..11766.67 rows=3410 width=24) -> Index Scan using case_data_case_id on case_data (cost=0.00..4.56 rows=1 width=120) -> Sort (cost=35.78..35.78 rows=522 width=48) -> Seq Scan on local_case_type (cost=0.00..12.22 rows=522 width=48) After: develop=# explain select * from actor_cases where actor_id = '18105A7313 53'; NOTICE: QUERY PLAN: Hash Join (cost=27801.99..53031.15 rows=306 width=192) -> Hash Join (cost=27788.47..51957.43 rows=11377 width=144) -> Seq Scan on case_data (cost=0.00..6932.35 rows=226535 width=120) -> Hash (cost=27693.03..27693.03 rows=11377 width=24) -> Seq Scan on actor_case_assignment (cost=0.00..27693.03 rows=11377 width=24) -> Hash (cost=12.22..12.22 rows=522 width=48) -> Seq Scan on local_case_type (cost=0.00..12.22 rows=522 width=48) And the difference between before & after is these update statements: update case_data set case_id = '18105'||case_id; update case_data set court_id = '18105'||court_id; update actor_case_assignment set actor_case_assignment_id = '18105'||actor_case_assignment_id; update actor_case_assignment set actor_id = '18105'||actor_id; update actor_case_assignment set case_id = '18105'||case_id; update local_case_type set court_id = '18105'||court_id; All of the "id" fields are varchar(50); There were similar updates on all keys in the DB, but this is everything I did on the involved tables. A vacuum analyze was run after the updates.
"Nick Fankhauser" <nickf@ontko.com> writes: >> Offhand I would think that 7.2 is smart enough to deal with this > We're on 7.1.3. Hmm... I don't see why it would make any difference in 7.1 either. You weren't by any chance selecting on the most common actor_id were you? Really 7.1 only has two possible selectivity estimates for column = 'constant' --- either the constant is the stored most-common-value, or it ain't. What do you get from select attname,attdispersion,s.* from pg_statistic s, pg_attribute a, pg_class c where starelid = c.oid and attrelid = c.oid and staattnum = attnum and relname = 'actor_case_assignment'; in each database? > Here are the before & after explains again: > Before: > monroe=# explain select * from actor_cases where actor_id = '18105A7313 53'; > After: > develop=# explain select * from actor_cases where actor_id = '18105A7313 53'; Are you really comparing apples to apples here? I'd think that a proper comparison of plans would be monroe=# explain select * from actor_cases where actor_id = 'A7313 53'; develop=# explain select * from actor_cases where actor_id = '18105A7313 53'; regards, tom lane
> select attname,attdispersion,s.* > from pg_statistic s, pg_attribute a, pg_class c > where starelid = c.oid and attrelid = c.oid and staattnum = attnum > and relname = 'actor_case_assignment'; > > in each database? Here are the results: The "Before" database: attname | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac | stacommonval | staloval | stahival --------------------------+---------------+----------+-----------+-------+-- -----------+---------------+---------------+-----------------------------+-- --------------------- actor_id | 0.00761992 | 2591915 | 1 | 1066 | 0 | 0.0335872 | JTW | 18105A08315 49 | XS case_id | 1.4583e-05 | 2591915 | 2 | 1066 | 0 | 7.28935e-05 | 501CP00243 | 02145556 | YW02647802 assigned_case_role | 0.100296 | 2591915 | 6 | 1066 | 0.000108355 | 0.250493 | DEFENDANT | Attorney | THIRD PARTY PLAINTIFF actor_case_assignment_id | -1 | 2591915 | 17 | 1066 | 0 | 9.85048e-07 | X693CF00396-S | 18105A100DF00438-S-17369 53 | XTV8605 685E-S (4 rows) The "After" database: attname | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac | stacommonval | staloval | stahival --------------------------+---------------+----------+-----------+-------+-- -----------+---------------+--------------------+--------------------------- -+----------------------- actor_id | 0.03247 | 33979335 | 1 | 1066 | 0 | 0.112092 | 18105XS | 18105A | 18105XS case_id | 1.22202e-05 | 33979335 | 2 | 1066 | 0 | 6.1086e-05 | 18105498CP01613 | 1810502145556 | 18105YW02647802 assigned_case_role | 0.0575305 | 33979335 | 6 | 1066 | 0.000108378 | 0.170868 | Attorney | Attorney | THIRD PARTY PLAINTIFF actor_case_assignment_id | -1 | 33979335 | 17 | 1066 | 0 | 9.85259e-07 | 18105X202CM01929-S | 18105A100CF00018-1-7888 53 | 18105XTV8605 685E-S (4 rows) > Are you really comparing apples to apples here? You're correct, but in this case the plans are the same if I use the value you suggest. The value I'm searching for is not a common one, so presumably, I'm getting the dispersion for a value that isn't the "most-common-value". The records in this table fall into three categories generally: Judges that have ten thousands of cases- Attorneys that have thousands of cases- Average Joes who have one or two cases- This means that a typical actor_id will have one record, but the average is probably about 2000. Thanks! -Nick
"Nick Fankhauser" <nickf@ontko.com> writes: > The "Before" database: > attname | attdispersion | starelid | staattnum | staop | > stanullfrac | stacommonfrac | stacommonval | staloval | > stahival > --------------------------+---------------+----------+-----------+-------+-- > -----------+---------------+---------------+-----------------------------+-- > --------------------- > actor_id | 0.00761992 | 2591915 | 1 | 1066 | > 0 | 0.0335872 | JTW | 18105A08315 49 | XS > The "After" database: > attname | attdispersion | starelid | staattnum | staop | > stanullfrac | stacommonfrac | stacommonval | staloval > | stahival > --------------------------+---------------+----------+-----------+-------+-- > -----------+---------------+--------------------+--------------------------- > -+----------------------- > actor_id | 0.03247 | 33979335 | 1 | 1066 | > 0 | 0.112092 | 18105XS | 18105A | > 18105XS Hm. In the "before" case you are showing JTW as the most common actor_id, with a frequency of 3.36% of the entries (busy judge, I suppose). In the "after" case you are showing "18105XS" as the most common actor_id, with a frequency of 11.2% of the entries. Where'd that come from? Is it correct? The change in plans occurs because of the increase in stacommonfrac. 7.1 basically uses stacommonfrac as the selectivity estimate if the constant is the stacommonval, otherwise stacommonfrac/10. So if stacommonfrac goes up, so does the estimated number of rows retrieved, and that's what's changing your plan. regards, tom lane
> In the "after" case you are showing "18105XS" as the most common > actor_id, with a frequency of 11.2% of the entries. Where'd that > come from? Is it correct? I believe this is correct, and the reason I've not been getting poor performance on the old database is that the stats are not up to date on the *old* DB. I've been so focused on problems with the updated DB that I didn't suspect that the database that was performing well had the "bad" stats. > 7.1 basically uses stacommonfrac as the selectivity estimate if the > constant is the stacommonval, otherwise stacommonfrac/10. So if > stacommonfrac goes up, so does the estimated number of rows retrieved, > and that's what's changing your plan. I understand. That piece of info put us on the track of a solution. In this case, we've got an unusual distribution that looks like this: Among the actors to which cases may be assigned: The State gets 10% of the cases 8 Judges get 3.5% of the cases each 50 Attorneys get about 0.1% each The remaining 388,000 actors get about 0.001% each. Given this unusual distribution, the planner can't predict well, so we're thinking that the best way to handle this is to set up a script to do our vacuum analyze, and then update stacommonfrac to be .01 for this particular field. This should give us great performance for the vast majority of the possible queries. It looks like the 8 judges are on the borderline, and we'll get poor performance if anyone happens to query on the State, but most of our users should understand that would be a silly query anyway. Is there another more graceful way to do this? Tom- Thanks for your patient help with this. We'll be sure to try this without the planner tweaks when we upgrade to 7.2 & let you know how it goes. regards, -Nick
"Nick Fankhauser" <nickf@ontko.com> writes: > In this case, we've got an unusual distribution that looks like this: > Among the actors to which cases may be assigned: > The State gets 10% of the cases > 8 Judges get 3.5% of the cases each > 50 Attorneys get about 0.1% each > The remaining 388,000 actors get about 0.001% each. > Given this unusual distribution, the planner can't predict well, so we're > thinking that the best way to handle this is to set up a script to do our > vacuum analyze, and then update stacommonfrac to be .01 for this particular > field. That's probably your best answer for 7.1 --- just force a suitable fudge-factor into pg_statistic after any VACUUM ANALYZE run. In 7.2, you could set the statistics target for actor_id to be 60 or 100 or so, and then the system would actually *know* the above distribution, and moreover would know the names of the judges and the attorneys. It'll be interesting to see how the plans change depending on whether you are searching for a judge or not ... regards, tom lane
On Mon, Apr 22, 2002 at 04:32:37PM -0400, Tom Lane wrote: > It'll be interesting to see how the plans change depending on whether > you are searching for a judge or not ... A whole new meaning to judge shopping, err, searching? Ross