Re: GiST penalty functions [PoC]

Поиск
Список
Период
Сортировка
От Andrew Borodin
Тема Re: GiST penalty functions [PoC]
Дата
Msg-id CAJEAwVG7NXPYJsLo7rpNi9dG9NiBJfOuK-xF_K2PRZmKg7-Hkg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: GiST penalty functions [PoC]  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: GiST penalty functions [PoC]  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
Hi Heikki!

> Got a link for a description of the RR*-tree? Would be good to reference it
> in the patch comments, too.
Well, as for now, the best way to reach the paper is
DOI 10.1145/1559845.1559929
http://sci-hub.cc/
Authors in conversations clearly stated that they endorse (not sure in
correct word) implementation in PostgreSQL, so I do not think it's a
bad kind of piracy.
More or less persistent link is http://dl.acm.org/citation.cfm?id=1559929

> If I understand correctly, cases #1 and #3 arise when one of the dimensions
> is 0. For example, in a 3D space, if the existing entry is a rectangle on a
> plane, with zero-length edge on one of the dimensions, and the new entry is
> on the same plane. Case #1 arises if the new entry falls within that
> rectangle, and case #3 if it's outside it. Currently, we treat all such
> cases as 0-penalty, because the degenerate 0-dimension causes all the
> calculated volumes to become zero. So clearly we can do better, which is
> what this patch does.
>
> At first blush, I'm surprised that you switch to using the sum of the edges
> in those cases. I would expect to ignore the degenerate 0-dimension, and
> calculate the volume using the other dimensions. So in a 3D space, you would
> calculate the increase of the area of the rectangle (A*B), not the sum of
> the edges (A+B). And it probably would be best to take into account how many
> of the dimensions are zero. So in a 3D space, if there is an existing line
> segment that the new point falls into, and also a rectangle that the new
> point falls into, you should prefer the 1-dimensional line segment over the
> 2-dimensional rectangle.
>
> I don't know how big a difference that makes in practice. But it seems odd
> that if you e.g. have a set of 3D points, but the Z dimension in all of the
> points is 0, the algorithm behaves differently than if you had the exact
> same points in a 2D space.
>
> (If this is explained in the RR*-tree paper, feel free to just point me to
> that and I'll ask again if I have further questions.)

As far as I know, your version of penalty function degradation is a
new concept regarding R-trees. I have not saw this idea before.
It is based on two assertions:
1. Degrading algorithms should resemble general algorithm, if choice
of general algorithm is correct.
2. Degradation happens when and only when at least on edge of MBB has zero size.

First assertion seems correct, while second is wrong. When you index
high-dimensional data (say 10 dimensions), you can easily reach 0 by
multiplication of values around 10^-4. And such data often is a result
of scaling and normalizing in machine learning, these are concepts
natural for them, along with high dimensinonality.

We can rejig your algorithm: edges are sorted in descending order, and
multiplied just before getting zero. Choose subtree picks tuple by
count of numbers multiplied, resolving ties by result of
multiplication.

We can get on fire with big edges, but current cube has no more than
100 dimensions. That is a little more than 1000 for each before
overlow (if multiplication is done in doubles).

Practically, this algorithm cannot be implemented in current GiST API
(only if we provide non-penalty-based choose subtree function,
optional for GiST extension), but it certainly has scientific value.

Regards, Andrey Borodin.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: patch: function xmltable
Следующее
От: David Fetter
Дата:
Сообщение: Re: New SQL counter statistics view (pg_stat_sql)