Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Martijn van Oosterhout
Тема Re: Cross-column statistics revisited
Дата
Msg-id 20081017062421.GA1443@svana.org
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Ответы Re: Cross-column statistics revisited  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult.

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Joshua Tolley"
Дата:
Сообщение: Re: Cross-column statistics revisited
Следующее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Cross-column statistics revisited