Re: GSoC 2014 proposal

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: GSoC 2014 proposal
Дата	3 апреля 2014 г. 19:22:08
Msg-id	533DB4D5.4010109@vmware.com обсуждение исходный текст
Ответ на	Re: GSoC 2014 proposal (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы	Re: GSoC 2014 proposal
Список	pgsql-hackers

Дерево обсуждения

On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>> hlinnakangas@vmware.com> wrote:
>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>> in memory. If I understood correctly, you're suggesting to use a pre-built
>>> GiST index instead. Interesting idea!
>>>
>>> There are a couple of signifcant differences between the CF tree
>>> described in the paper and GiST:
>>>
>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>>> a leaf item represents a cluster, which consists of one or more tuples. So
>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>> possible to keep it in memory.
>>>
>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>> requirement, with respect to a threshold value T: the diameter (or radius)
>>> has to be less than T". GiST imposes no such restrictions. An item can
>>> legally be placed anywhere in the tree; placing it badly will just lead to
>>> degraded search performance, but it's still a legal GiST tree.
>>>
>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>> information from a non-leaf node and use it in the result, as the
>>> information summarized at a non-leaf level includes noise from the dead
>>> tuples.
>>>
>>> Can you elaborate how you are planning to use a GiST index to implement
>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>> strict in where in the tree an item can be stored, and lets the operator
>>> class to specify exactly when a node is split etc.
>>>
>>
>> Hmmm, it's likely I've imagined something quite outside of this paper, and
>> even already suggested it to Ivan... :)
>> I need a little time to rethink it.
>>
>
> Using GiST we can implement BIRCH-like clustering like so:
> 1) Build a CF tree as GiST index without restriction of T threshold value.
> 2) Scan CF tree with threshold T with some auxiliary operator. If
> consistent method see CF entry which diameter is greater than T then it
> returns true. Otherwise it returns false and put this CF entry into output
> area (could be either in-memory or temporary table).
> 3) Process other steps of algorithm as usual.

I still don't understand how that would work. You can't trust the 
non-leaf entries, because their CF value can contain deleted entries. So 
you have to scan every tuple anyway. Once you do that, what's the point 
of the index?

- Heikki

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: GSoC 2014 proposal