Call for alpha testing: planner statistics revisions
От | Tom Lane |
---|---|
Тема | Call for alpha testing: planner statistics revisions |
Дата | |
Msg-id | 11872.992810825@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Call for alpha testing: planner statistics revisions
|
Список | pgsql-general |
I have finished a first pass at the planner statistics and cost estimation changes that I want to do for 7.2. It's now time to see how well the new code does in the real world. Accordingly, if you've had problems in the past with silly choices of plans, I'd like to ask you to load your data into a test installation of current sources and see if the planner is any brighter than before. I realize that loading a bunch of data into a temporary installation is a pain in the neck, but it'd be really great to get some feedback about performance of the new code now, while we're still early enough in the 7.2 development cycle to do something about any problems that turn up. If you're willing to help out, you can get current sources from the CVS server, or from the nightly snapshot tarball (see the dev/ directory on your favorite Postgres FTP mirror). Some highlights of the new code include: * ANALYZE is now available as a separate command; you can run it without also doing a VACUUM. (Of course, VACUUM ANALYZE still works.) * On large tables, ANALYZE uses a random sample of rows rather than examining every row, so that it should take a reasonably short time even on very large tables. Possible downside: inaccurate stats. We need to find out if the sample size is large enough. * Statistics now include the "top ten" most common values, not just the single most common value, plus an estimate of the total number of distinct values in a column. This should mean that selectivity estimates for "column = something" estimates are a lot better than before, especially for highly skewed data distributions. * Statistics also include (for scalar datatypes) a histogram that gives the boundary values dividing the data into ten roughly-equal-population bins. This should allow much better estimation for inequality and range queries, again especially for skewed data distributions. (Note that "range queries" include such things as anchored LIKE and regexp searches, plus now inet subnet searches thanks to Alex Pilosov.) * The magic number "ten" mentioned above is controllable via ALTER TABLE tab ALTER COLUMN col SET STATISTICS statstarget. Adjusting it gives a tradeoff between estimation accuracy and time/space taken by ANALYZE. We need to find out if ten is a good default or not ... it might be too high or too low. * There's also a physical-order-correlation statistic that should help the planner deal with clustered indices better. Whether it's good enough, and whether the costs are correctly interpolated using it, remain to be seen. For more info see my original proposal at http://fts.postgresql.org/db/mw/msg.html?mid=112714 and followup discussion. regards, tom lane
В списке pgsql-general по дате отправления: