GSoC application: MADlib k-medoids clustering

Поиск
Список
Период
Сортировка
От Maxence Ahlouche
Тема GSoC application: MADlib k-medoids clustering
Дата
Msg-id CAJeaomV+mxkiYfWwYBOefbpJPe2JUKqysuA=5A7WB0PvPw_uGw@mail.gmail.com
обсуждение исходный текст
Ответы Re: GSoC application: MADlib k-medoids clustering  (Maxence Ahlouche <maxence.ahlouche@gmail.com>)
Список pgsql-hackers
Hi all,

Some of you may remember me from last year: I had applied to implement the k-medoids clustering method in MADlib, a Postgres and GreenPlum library.
As this project could not be selected last year, I'm trying again now!

You can find my application at this address: https://github.com/viodlen/gsoc_2014
I've also pasted it at the end of this email.

As for me: I'm Maxence Ahlouche, a French student in computer science. I've been studying IT for almost 4 years now, the first two of them being in a technical degree. I'm currently in an engineering school, and am supposed to graduate next year.


However, there's a problem with my application: the GSoC is supposed to begin on may 19, but I have lectures and exams until at least June 4, and the official end of my school year is on June 13. The period between June 4 and June 14 is intended for students to finish their school projects, if they haven't already.
In the worst case, I won't have time to do anything significant for the first 4 weeks.

Still, I think I'm able to handle this situation, and this worst case is quite unlikely. Between May 19 and June 4, I'll have lectures and exams, but it should still be a quiet time, and I'll have time to work on my GSoC project the evenings and week-ends.  After June 4, I'll only have my school projects to finish and present, and I think I'll be able to spend most of my time on GSoC.
I've taken it into account in my proposal's schedule, that's why the first part takes almost as long as the second, even though it looks easier to me.

Of course, if the community doesn't want to take this risk, I'll understand.

Regards,
Maxence Ahlouche

My proposal:
GSoC 2014 proposal: implementing clustering algorithms in MADlib
================================================================

Synopsis
--------

This project aims to implement some clustering algorithms in MADlib,
which is a data analytics and machine learning library for PostgreSQL,
Greenplum and HAWQ.

Benefits to the PostgreSQL community
------------------------------------

Currently, only the k-means clustering algorithm is implemented in
MADlib (see the doc:
http://doc.madlib.net/latest/group__grp__clustering.html ). The
k-medoids algorithm, while being computationnally more intensive, is
much less sensitive to outliers (points that don't belong obviously to
one cluster or another). This is interesting on noisy datasets, that's
why I'm planning to implement it during the first part of the GSoC.

Still, these algorithms are based on distance computation, therefore
they can only find convex clusters. That's why I'm proposing to
implement the OPTICS (*ordering points to identify the clustering
structure*, see http://en.wikipedia.org/wiki/OPTICS_algorithm ), which
addresses this issue, as the second part of this GSoC project.

The PostgreSQL community would benefit from these features, as it
would make available clustering algorithms more powerful than simple
k-means.

Project details
---------------

k-medoids
"""""""""

The first goal of this project is to implement the k-medoids
clustering algorithm. For this, I'll first spend some time studying
the k-means algorithm, as both will probably be pretty similar. This
will also allow me to get familiar with the codebase, the conventions,
the data structures I'll need, etc.

Then I'll implement, test and debug the algorithm. If relevant, I'll
also provide a "k-medoids++" version, which, similarly to the
k-means++ function in MADlib, will chose the initial centroids
depending on the dataset, instead of chosing them randomly. This
allows to detect small clusters located far from the others (which are
usually detected as part of an other bigger cluster using the standard
algorithm).

The final step would be to refactor the code from k-means and
k-medoids to remove any code duplication introduced in this first
part.

OPTICS
""""""

The second part of this project would be to implement the
density-based clustering algorithm OPTICS, which would overcome the
main problem of both the k-means and k-medoids algorithm: non-convex
clusters. This algorithm has been preferred over DBSCAN
(http://en.wikipedia.org/wiki/DBSCAN ) as it is able to detect
clusters of different densities, and, consequently, overlapping
clusters.

I'll first take some time to understand full well the algorithm, and
make a prototype in Python, to be sure I know how it works. Then I'll
actually implement it, test it, and debug it in MADlib.

If, after that, any time's left, I'll consider implementing some
of the improvements of k-means and k-medoids that we can find in the
litterature.

Deliverables
------------

* the k-medoids algorithm in MADlib;
* the OPTICS algorithm, also in MADlib;
* optionnally, some improvements on k-means and/or k-medoids.

Project Schedule
----------------

#. Implementation of the k-medoids algorithm: from 19/05 to 30/05
#. Tests, debug and doc of k-medoids: from 31/05 to 13/06
#. Prototype of OPTICS in Python: from 14/06 to 18/06
#. Actual implementation of OPTICS in MADlib: from 19/06 to 25/06
#. Tests, debug and doc of OPTICS: from 25/06 to 11/06

--
Maxence Ahlouche
06 06 66 97 00

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tatsuo Ishii
Дата:
Сообщение: logical decoding doc
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: pg_archivecleanup bug