Re: estimating # of distinct values
От | Tomas Vondra |
---|---|
Тема | Re: estimating # of distinct values |
Дата | |
Msg-id | 4D192963.8080305@fuzzy.cz обсуждение исходный текст |
Ответ на | Re: estimating # of distinct values (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
Dne 27.12.2010 22:46, Robert Haas napsal(a): > 2010/12/27 Tomas Vondra <tv@fuzzy.cz>: >> But even though these disadvantages, there really is no other >> way to enhance the estimates. I don't think this should be a >> default behavior - just as in case of cross-column stats this should >> be optional when the current estimator does not work well. > > This is going to be a lot of work to implement, so before you do it, > we should try to reach a consensus that (a) it's part of an overall > strategy that the community generally supports and (b) we have > consensus on the design for this part. Yes, it is going to be a lot of work. But I don't think we need a consensus of the whole community prior to building something that works. I plan to build a very simple prototype soon, so let's talk about this later. I've started these two threads mainly as a 'brainstorming' - to present what I've learned from the various papers, propose a possible solution, highlight issues I see, and get ideas on how to solve them. > With respect to (a), I have to admit I've found the discussion on > cross-column stats to be quite difficult to follow. I'd like to see a > very simple description of exactly what information we're going to > store, under what circumstances we'll store it, and how we'll use it > to compute selectivity estimates. Yes, I know it was difficult to follow that discussion. That's why I created a 'summary page' on wiki http://wiki.postgresql.org/wiki/Cross_Columns_Stats Generally we need to gather two types of stats: (a) multi-dimensional histogram - this can be done about the same way we create single-dimensional histograms, that'snot a big problem (although more data might be necessary to get accurate histograms) (b) # of distinct values - this is the tricky part, as described in this problem > With respect to (b), I think I'd need to see a much more detailed > design for how you intend to make this work. Off the top of my head > there seems to be some pretty serious feasibility problems. Well, it's not going to be easy, that's for sure. My "big plan" is something like this: (a) Build a simple 'contrib-like' module that allows manual collection of stats and computing of estimates (not integratedwith the optimizer in any way). This should help us better understand what stats do we need etc. (b) Minimal integration with the core - still manual collection fo stats, the optimizer uses the stats if available. (c) Enhancements - automatic updates of the stats, etc. And all this should be implemented so that you don't have to pay unless you want to use the new stats. regards Tomas
В списке pgsql-hackers по дате отправления: