Обсуждение: Idea for the statistics collector
Since it's currently all for collecting statistics on tables, why can't it collect another type of statistic, like: - How often the estimator gets it wrong? At the end of an index scan, the executor could compare the number of rows returned against what was estimated, and if it falls outside a certain range, flag it. Also, the average ratio of rows coming out of a distinct node vs the number going in. For a join clause, the amount of correlation between two columns (hard). etc Ideally, the planner could then use this info to make better plans. Eventually, the whole system could become somewhat self-tuning. Does anyone see any problems with this? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Martijn van Oosterhout <kleptog@svana.org> writes: > Since it's currently all for collecting statistics on tables, why can't it > collect another type of statistic, like: > - How often the estimator gets it wrong? > [snip] > Does anyone see any problems with this? (1) forced overhead on *every* query. (2) contention to update the same rows of pg_statistic (or wherever you plan to store this info). (3) okay, so the estimate was wrong; exactly which of the many parameters that went into the estimate do you plan to twiddle? What if it's not the parameter values that are at fault, but the cost-model equations themselves? Closed-loop feedback is a great thing when you understand the dynamics of the system you intend to apply feedback control to. When you don't, it's a great way to shoot yourself in the foot. Unfortunately I don't think the PG optimizer falls in the first category at present. regards, tom lane
On Fri, Jun 21, 2002 at 12:47:18AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Since it's currently all for collecting statistics on tables, why can't it > > collect another type of statistic, like: > > - How often the estimator gets it wrong? > > [snip] > > Does anyone see any problems with this? > > (1) forced overhead on *every* query. If yo don't want it, don't use it. The current statistics have the same issue and you can not do those as well. > (2) contention to update the same rows of pg_statistic (or wherever you > plan to store this info). True, can't avoid that. Depends on how many queries you. Maybe only enable it for specific sessions? > (3) okay, so the estimate was wrong; exactly which of the many > parameters that went into the estimate do you plan to twiddle? > What if it's not the parameter values that are at fault, but the > cost-model equations themselves? Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq Scan, Distinct). Other types have far more variables. Secondly, even if you only count, it's useful. For example, if it tells you that the planner is off by a factor of 10 more than 75% of the time, that's useful information independant of what the actual variables are. > Closed-loop feedback is a great thing when you understand the dynamics > of the system you intend to apply feedback control to. When you don't, > it's a great way to shoot yourself in the foot. Unfortunately I don't > think the PG optimizer falls in the first category at present. Using the results for planning is obviously a tricky area and should proceed with caution. But just collecting statistics shouldn't be too bad? See also -hackers. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Martijn van Oosterhout wrote: > Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq > Scan, Distinct). Other types have far more variables. Secondly, even if you > only count, it's useful. For example, if it tells you that the planner is > off by a factor of 10 more than 75% of the time, that's useful information > independant of what the actual variables are. Yes, only updating the stats if the estimate was off by a factor of 10 or so should cut down on the overhead. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Martijn van Oosterhout wrote: >> Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq >> Scan, Distinct). Other types have far more variables. Secondly, even if you >> only count, it's useful. For example, if it tells you that the planner is >> off by a factor of 10 more than 75% of the time, that's useful information >> independant of what the actual variables are. > Yes, only updating the stats if the estimate was off by a factor of 10 > or so should cut down on the overhead. And reduce the usefulness even more ;-). As a pure stats-gathering exercise it might be worth doing, but not if you only log the failure cases. How will you know how well you are doing if you take a biased-by-design sample? regards, tom lane
Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Martijn van Oosterhout wrote: > >> Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq > >> Scan, Distinct). Other types have far more variables. Secondly, even if you > >> only count, it's useful. For example, if it tells you that the planner is > >> off by a factor of 10 more than 75% of the time, that's useful information > >> independant of what the actual variables are. > > > Yes, only updating the stats if the estimate was off by a factor of 10 > > or so should cut down on the overhead. > > And reduce the usefulness even more ;-). As a pure stats-gathering > exercise it might be worth doing, but not if you only log the failure > cases. How will you know how well you are doing if you take a > biased-by-design sample? Sure is it required to count all cases, success and failure. But I don't see why it is required to feed that information constantly back into the statistics tables. As long as we don't restart, it's perfectly good in the collector. And it must not be fed back to the backend on every query. Maybe ANALYZE would like to have some of that information? If memory serves, ANALYZE does a poor job when the data isn't well distributet, has few distinct values and the like. That causes wrong estimates then (among other things, of course). The idea could be, to have ANALYZE take a much closer look at tables with horrible estimates, to generate better information for those. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
>Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Martijn van Oosterhout wrote: > > >> Firstly, I was only thinking of going for the basic nodes (Index > Scan, Seq > > >> Scan, Distinct). Other types have far more variables. Secondly, even > if you > > >> only count, it's useful. For example, if it tells you that the > planner is > > >> off by a factor of 10 more than 75% of the time, that's useful > information > > >> independant of what the actual variables are. > > > > And reduce the usefulness even more ;-). As a pure stats-gathering > > exercise it might be worth doing, but not if you only log the failure > > cases. How will you know how well you are doing if you take a > > biased-by-design sample? Personally, given that it seems like at least once or twice a day someone asks about performance or "why isn't my index being used" and other stuff - I think doing this would be a great idea. Perhaps not necessarily in the full-fledged way, but creating a sort of "ANALYZE log," wherein it logs the optimizer's estimate of a query and the actual results of a query, for every query. This, of course, could be enableable/disableable on a per-connection basis, per-table basis (like OIDs), or whatever other basis makes life easiest to the developers. Then, when the next ANALYZE is run, it could do it's usual analysis, and apply some additional heuristics based upon what it learns from the "ANALYZE log," possibly to do several things: 1) Automatically increase/decrease the SET STATISTICS information included in the analyze, for example, increasing it as a table grows larger and the "randomness" grows less than linearly with size (e.g., if you have 50 or 60 groups in a 1,000,000 row table, that certainly needs a higher SET STATISTICS and I do it on my tables). 2) Have an additional value on the statistics table called the "index_heuristic" or "random_page_adjustment_heuristic" which when 1 does nothing, but otherwise modifies the cost of using an index/seq scan by that factor - and don't ever change this more than a few percent each ANALYZE 3) Flags in a second log (maybe the regular log) really bad query estimates - let it do an analysis of the queries and flag anything two or three std deviations outside. Now, I suggest all this stuff in the name of usability and self-maintainability. Unfortunately, I don't have the wherewithal to actually assist in development. Another possibility is to put "use_seq_scan" default to OFF, or whatever the parameter is (I did my optimizing a while ago so it's fading), so that if there's an index, it will use it, regardless - as this seems to be what the great majority of people expect to happen. And/or add this to a FAQ, and let us all reply "see http://.../indexfaq.html." :) Cheers, Doug
Added to TODO list: * Log queries where the optimizer row estimates were dramatically different from the number of rows actually found (?) --------------------------------------------------------------------------- Doug Fields wrote: > > >Tom Lane wrote: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > Martijn van Oosterhout wrote: > > > >> Firstly, I was only thinking of going for the basic nodes (Index > > Scan, Seq > > > >> Scan, Distinct). Other types have far more variables. Secondly, even > > if you > > > >> only count, it's useful. For example, if it tells you that the > > planner is > > > >> off by a factor of 10 more than 75% of the time, that's useful > > information > > > >> independant of what the actual variables are. > > > > > > And reduce the usefulness even more ;-). As a pure stats-gathering > > > exercise it might be worth doing, but not if you only log the failure > > > cases. How will you know how well you are doing if you take a > > > biased-by-design sample? > > Personally, given that it seems like at least once or twice a day someone > asks about performance or "why isn't my index being used" and other stuff - > I think doing this would be a great idea. > > Perhaps not necessarily in the full-fledged way, but creating a sort of > "ANALYZE log," wherein it logs the optimizer's estimate of a query and the > actual results of a query, for every query. This, of course, could be > enableable/disableable on a per-connection basis, per-table basis (like > OIDs), or whatever other basis makes life easiest to the developers. > > Then, when the next ANALYZE is run, it could do it's usual analysis, and > apply some additional heuristics based upon what it learns from the > "ANALYZE log," possibly to do several things: > > 1) Automatically increase/decrease the SET STATISTICS information included > in the analyze, for example, increasing it as a table grows larger and the > "randomness" grows less than linearly with size (e.g., if you have 50 or 60 > groups in a 1,000,000 row table, that certainly needs a higher SET > STATISTICS and I do it on my tables). > 2) Have an additional value on the statistics table called the > "index_heuristic" or "random_page_adjustment_heuristic" which when 1 does > nothing, but otherwise modifies the cost of using an index/seq scan by that > factor - and don't ever change this more than a few percent each ANALYZE > 3) Flags in a second log (maybe the regular log) really bad query estimates > - let it do an analysis of the queries and flag anything two or three std > deviations outside. > > Now, I suggest all this stuff in the name of usability and > self-maintainability. Unfortunately, I don't have the wherewithal to > actually assist in development. > > Another possibility is to put "use_seq_scan" default to OFF, or whatever > the parameter is (I did my optimizing a while ago so it's fading), so that > if there's an index, it will use it, regardless - as this seems to be what > the great majority of people expect to happen. And/or add this to a FAQ, > and let us all reply "see http://.../indexfaq.html." :) > > Cheers, > > Doug > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073