GSoC 2014 proposal

Поиск

Список

Период

Сортировка

От	Иван Парфилов
Тема	GSoC 2014 proposal
Дата	30 марта 2014 г. 20:50:29
Msg-id	CAMRhwe7tMzJtxy4q5UTfdethemptTR8GsofCufkDJ574qqO2tA@mail.gmail.com обсуждение исходный текст
Ответы	Re: GSoC 2014 proposal Re: GSoC 2014 proposal
Список	pgsql-hackers

Дерево обсуждения

Hello, hackers! This is my GSoC proposal.

Short description:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The purpose of this project is to add support of BIRCH(balanced iterative reducing and clustering using hierarchies) algorithm [1] for data type cube.

Benefits to the PostgreSQL Community

Support of BIRCH algorithm for data type cube would be actual for many PostgreSQL applications (for example, to solve data clustering problem for high dimensional datasets and for large datasets).

Quantifiable results

Adding support of BIRCH algorithm for data type cube

Project Details
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.

The BIRCH algorithm (Balanced Iterative Reducing and Clustering Hierarchies) of Zhang [1] was developed to handle massive datasets that are too large to be contained in the main memory (RAM). To minimize I/O costs, every datum is read once and only once. BIRCH transforms the data set into compact, locally similar subclusters, each with summary statistics attached (called clustering features). Then, instead of using the full data set, these summary statistics can be used. This approach is most advantageous in two situations: when the data cannot be loaded into memory due to its size; and/or when some form of combinatorial optimization is required and the size of the solution space makes ﬁnding global maximums/minimums difﬁcult.

Key properties of BIRCH algorithm:

single scan of the dataset enough;

I/O cost minimization: Organize data in a in-memory, height-balanced tree;

Each clustering decision is made without scanning all the points or clusters.

The implementation of this algorithm would be for data type cube and based on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N d-dimensional data points, the clustering feature CF of the set is defined as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data points. Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor B and threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for implementation [2].
The GiST is a balanced tree structure like a B-tree, containing <key, pointer> pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries.

There are seven methods that an index operator class for GiST must provide, and an eighth that is optional:

-consistent

-union

-compress

-decompress

-penalty

-picksplit

-equal

-distance (optional).

We need to implement it to create GiST-based CF-tree to use it in BIRCH algorithm.

Example of usage(approximate):

create table cube_test (v cube);

insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]),

cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

cluster | val1 | val2

---------+------+--------

1 | 1.2 | 0.4

0 | 0.5 | -0.2

1 | 0.6 | 1.0

1 | 1.0 | 0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would be implemented.

Inch-stones

1) Solve architecture questions with help of community.

2) First, approximate implementation(implement distance methods, implement GiST interface methods, implement BIRCH algorithm for data type cube).

3) Approximate implementation evaluation.

4) Final refactoring, documentation, testing.

Project Schedule

until May 19

Solve architecture questions with help of community.

20 May - 27 June

First, approximate implementation.

28 June - 11 August

Approximate implementation evaluation. Fixing bugs and performance testing.

August 11 - August 18:

Final refactoring, write tests, improve documentation.

Completeness Criteria

Support of BIRCH algorithm for data type cube is implemented and working.

Links

1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html

----

With best regards, Ivan Parfilov.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

GSoC 2014 proposal