GSoC 2014 proposal

Поиск
Список
Период
Сортировка
От Иван Парфилов
Тема GSoC 2014 proposal
Дата
Msg-id CAMRhwe7tMzJtxy4q5UTfdethemptTR8GsofCufkDJ574qqO2tA@mail.gmail.com
обсуждение исходный текст
Ответы Re: GSoC 2014 proposal  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: GSoC 2014 proposal  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
Hello, hackers! This is my GSoC proposal.

Short description: 

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The purpose of this project is to add support of BIRCH(balanced iterative reducing and clustering using hierarchies) algorithm [1] for data type cube.

Benefits to the PostgreSQL Community

Support of BIRCH algorithm for data type cube would be actual for many PostgreSQL applications (for example, to solve data clustering problem for high dimensional datasets and for large datasets).

 Quantifiable results

 Adding support of BIRCH algorithm for data type cube

Project Details
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.

The BIRCH algorithm (Balanced Iterative Reducing and Clustering Hierarchies) of Zhang [1] was developed to handle massive datasets that are too large to be contained in the main memory (RAM). To minimize I/O costs, every datum is read once and only once. BIRCH transforms the data set into compact, locally similar subclusters, each with summary statistics attached (called clustering features). Then, instead of using the full data set, these summary statistics can be used. This approach is most advantageous in two situations: when the data cannot be loaded into memory due to its size; and/or when some form of combinatorial optimization is required and the size of the solution space makes finding global maximums/minimums difficult.

Key properties of BIRCH algorithm:

single scan of the dataset enough;

I/O cost minimization: Organize data in a in-memory, height-balanced tree;

Each clustering decision is made without scanning all the points or clusters.

The implementation of this algorithm would be for data type cube and based on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N d-dimensional data points, the clustering feature CF of the set is defined as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data points. Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor B and threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for implementation [2].
The GiST is a balanced tree structure like a B-tree, containing <key, pointer> pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries. 

There are seven methods that an index operator class for GiST must provide, and an eighth that is optional:

-consistent

-union

-compress

-decompress

-penalty

-picksplit

-equal 

-distance (optional).

We need to implement it to create GiST-based CF-tree to use it in BIRCH algorithm.


Example of usage(approximate):

create table cube_test (v cube);

insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]),

cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

 cluster | val1 | val2    

---------+------+--------

      1  |  1.2 |  0.4

      0  |  0.5 | -0.2

      1  |  0.6 |  1.0

      1  |  1.0 |  0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would be implemented.


Inch-stones

 1) Solve architecture questions with help of community.

 2) First, approximate implementation(implement distance methods, implement GiST interface methods, implement BIRCH algorithm for data type cube).

3) Approximate implementation evaluation.

4) Final refactoring, documentation, testing.


 Project Schedule

 until May 19

 Solve architecture questions with help of community.

 20 May - 27 June

 First, approximate implementation.

 28 June - 11 August

 Approximate implementation evaluation. Fixing bugs and performance testing.

 August 11 - August 18:

 Final refactoring, write tests, improve documentation.

 Completeness Criteria

 Support of BIRCH algorithm for data type cube is implemented and working.

 Links

1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf 
2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html

 ----

With best regards, Ivan Parfilov.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Christoph Berg
Дата:
Сообщение: Re: Securing "make check" (CVE-2014-0067)
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: [PATCH] Negative Transition Aggregate Functions (WIP)