Re: GSoC 2014 proposal

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: GSoC 2014 proposal
Дата
Msg-id CAPpHfdt6B-m7hz6LexAx9EriWKAWQmK=ZDHzEw3mxrL3nxUU4w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: GSoC 2014 proposal  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: GSoC 2014 proposal  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory.

2. In the CF tree, "all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T". GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples.

Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc.

Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :)
I need a little time to rethink it.

Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

This modification would have following advantages:
1) User can build GiST index once and then try clustering with different parameters. Initial GiST index build would be slowest operation while other steps is expected to be fast.
2) Use GiST infrastructure and automatically get buffering build.

The drawback is that building GiST index is more expensive than building in-memory CF tree with given threshold T (assuming T is well chosen).

Does it make any sense?

------
With best regards,
Alexander Korotkov.  

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: GSoC proposal - "make an unlogged table logged"
Следующее
От: Tom Lane
Дата:
Сообщение: Re: quiet inline configure check misses a step for clang