Re: GSoC 2014 proposal

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: GSoC 2014 proposal
Дата
Msg-id CAPpHfdteAfUo5YG1Q46Hq3nm9LDkFJcE3feOKp=rJwqPUrbvUA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: GSoC 2014 proposal  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

The BIRCH algorithm as described in the paper describes building a tree
in memory. If I understood correctly, you're suggesting to use a pre-built
GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree
described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
a leaf item represents a cluster, which consists of one or more tuples. So
the CF tree doesn't store an entry for every input tuple, which makes it
possible to keep it in memory.

2. In the CF tree, "all entries in a leaf node must satisfy a threshold
requirement, with respect to a threshold value T: the diameter (or radius)
has to be less than T". GiST imposes no such restrictions. An item can
legally be placed anywhere in the tree; placing it badly will just lead to
degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also
for deleted tuples, until the index is vacuumed. So you cannot just use
information from a non-leaf node and use it in the result, as the
information summarized at a non-leaf level includes noise from the dead
tuples.

Can you elaborate how you are planning to use a GiST index to implement
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
strict in where in the tree an item can be stored, and lets the operator
class to specify exactly when a node is split etc.


Hmmm, it's likely I've imagined something quite outside of this paper, and
even already suggested it to Ivan... :)
I need a little time to rethink it.


Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

I still don't understand how that would work. You can't trust the non-leaf entries, because their CF value can contain deleted entries. So you have to scan every tuple anyway. Once you do that, what's the point of the index?

Yeah, it is limitation of this idea. It's not going to be auto-updatable CF. User can build index on freshly vacuumed table and play with clustering some time. Updates on table during that time would be either allowed clustering error or prohibited. Another potential solution is to let this index to hold some snapshot of data. But it seems not possible to do in extension now.

------
With best regards,
Alexander Korotkov.   

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: GSoC 2014 proposal
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [COMMITTERS] pgsql: Add ALTER TABLESPACE ... MOVE command