Re: estimating # of distinct values

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: estimating # of distinct values
Дата
Msg-id 4D192963.8080305@fuzzy.cz
обсуждение исходный текст
Ответ на Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Dne 27.12.2010 22:46, Robert Haas napsal(a):
> 2010/12/27 Tomas Vondra <tv@fuzzy.cz>:
>>   But even though these disadvantages, there really is no other
>>   way to enhance the estimates. I don't think this should be a
>>   default behavior - just as in case of cross-column stats this should
>>   be optional when the current estimator does not work well.
> 
> This is going to be a lot of work to implement, so before you do it,
> we should try to reach a consensus that (a) it's part of an overall
> strategy that the community generally supports and (b) we have
> consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

> With respect to (a), I have to admit I've found the discussion on
> cross-column stats to be quite difficult to follow.  I'd like to see a
> very simple description of exactly what information we're going to
> store, under what circumstances we'll store it, and how we'll use it
> to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki
  http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:
 (a) multi-dimensional histogram - this can be done about the same way     we create single-dimensional histograms,
that'snot a big problem     (although more data might be necessary to get accurate histograms)
 
 (b) # of distinct values - this is the tricky part, as described in     this problem

> With respect to (b), I think I'd need to see a much more detailed
> design for how you intend to make this work.  Off the top of my head
> there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My "big plan" is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection   of stats and computing of estimates (not
integratedwith the   optimizer in any way).
 
   This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo   stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Kevin Grittner"
Дата:
Сообщение: Re: estimating # of distinct values
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: "writable CTEs"