Re: Does "correlation" mislead the optimizer on large
От | Ron Mayer |
---|---|
Тема | Re: Does "correlation" mislead the optimizer on large |
Дата | |
Msg-id | Pine.LNX.4.44.0301241138070.986-100000@localhost.localdomain обсуждение исходный текст |
Ответ на | Re: Does "correlation" mislead the optimizer on large (Stephan Szabo <sszabo@megazone23.bigpanda.com>) |
Ответы |
Re: Does "correlation" mislead the optimizer on large
|
Список | pgsql-performance |
On Fri, 24 Jan 2003, Stephan Szabo wrote: > > I think it's a clumping effect. Yup, I think that's exactly the effect. A proposal.... (yes I I'm volunteering if people point me in the right direction)... would be to have a "plugable" set of analyze functions so that a huge database that runs analyze infrequently could choose to have a very slow analyze that might work better for it's data. I see no reason different analyze functions would to be compiled into the source code ... but could probably exists as PL/pgSQL languages. The one thing compiling it in would help with is to let me know the exact number of tuples on each individual page, but I guess reltuples/relpages from pg_class is a good estimate. > For example, I made a table (ordered) with 20 values of a, 50 values of b > (each showing up in each a) and 100 values of c (not used, just means 100 > rows for each (a,b) combination. It's got 541 pages it looks like. Analyze > sets the correlation to about 0.08 on the table and so a query like: > select * from test1 where b=1; prefers a sequence scan (1791 vs 2231) > while the index scan actually performs about 5 times better. That sounds like the same situation I was in. If my logic is right, this means you had about 184 tuples/page (200*50*100/541), so it looks to me like for each "a", you get half-a-page where "b=1". If you had 'c' have 200 values, I think you'd get even a bigger speedup because half the page is still "wasted" with b=2 values. If you had 'c' have 10000 values, I think you'd get even a slightly bigger speedup because you'd have so many b=1 pages next to each other you'd benefit from more sequential disk access. > I guess the reason is that in general, the index scan *really* is reading > something on the order of 40 pages rather than the much larger estimate > (I'd guess something on the order of say 300-400? I'm not sure how to > find that except by trying to reverse engineer the estimate number), Or by adding a printf()... I think it'd be in cost_index in costsize.c. > because pretty much each value of a will probably have 1 or 2 pages with > b=1. > > I'm not really sure how to measure that, however. As I said... I'm happy to volunteer and experiment if people point me in a good direction. Ron
В списке pgsql-performance по дате отправления: