Re: Multi-Dimensional Histograms

Поиск

Список

Период

Сортировка

От	Nathan Boley
Тема	Re: Multi-Dimensional Histograms
Дата	29 июня 2009 г. 20:28:04
Msg-id	6fa3b6e20906291328jc6f33b9me6f911845ea70bfa@mail.gmail.com обсуждение исходный текст
Ответ на	Multi-Dimensional Histograms (David Fetter <david@fetter.org>)
Ответы	Re: Multi-Dimensional Histograms
Список	pgsql-hackers

Дерево обсуждения

> For things like PostGIS, which will want to index in 4 dimensions
> (x, y, z, t), we might want to have multi-dimensional selectivity
> histograms and some way to use same.
>

Another use case is cross column statistics.

> Anybody here qualified to check out this paper on the subject, please
> speak up :)
>

Well, I don't know if I'm qualified per se, but here goes...

It seems seems like a very reasonable approach to multidimensional histograms.

For those who haven't had time to read the paper, it describes SGRID,
a greedy algorithm for partitioning an n-dimensional space into
non-overlapping n-dimensional boxes and 'outliers' ( points that dont
fit into any boxes, but dont warrant their own boxes ). Then, they
compare their method to a regression technique,  where the space is
fit into a fixed grid histogram and then the grid densities are
estimated via regression. They also compare a plane splitting
algorithm, called MHIST, where the planes are constrained to be
orthogonal to the naive dimension vectors. They dismiss singular value
decomposition and the discrete wavelet transform as being too
parametric ( which is silly, IMHO ) and briefly mention a couple
standard clustering techniques ( which are probably not appropriate
for us, given their complexity ). Unsurprisingly, it does well on the
two test sets that they consider.

It think the general idea is fine, but it would certainly need some
modification to work for postgres.

In particular,

Using the naive dimension vectors ( ie, north and east for geo data,
or column a and column b for cross-column stats ) in combination with
the box constraint will probably lead to problems. Consider, for
example, a river that goes in a straight line north-east, and a table
that stores camp sites which are mostly along the river. Because SGRID
can only create boxes with north east edges, you would end up with a
bunch of tiny box on the river where one, long north-east pointing box
would do.

We would probably want to replace the error term that SGRID uses to
determine when to create a new box by a statistical test ( maybe
homogeneity of means? ) ( also, the same holds for the outliers ).

Finally, this creates the partition but ( AFAICT ) it doesn't describe
a method for locating the histogram estimate given a point ( although
that doesn't seem too difficult ).

-Nathan

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Josh Berkus
Дата: 29 июня 2009 г., 20:03:07
Сообщение: Re: pre-proposal: permissions made easier

Следующее

От: David Fetter
Дата: 29 июня 2009 г., 20:43:40
Сообщение: Re: Multi-Dimensional Histograms

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Multi-Dimensional Histograms

Предыдущее

Следующее