GSoC 2014: Implementing clustering algorithms in MADlib

Поиск
Список
Период
Сортировка
От Maxence Ahlouche
Тема GSoC 2014: Implementing clustering algorithms in MADlib
Дата
Msg-id CAJeaomWrwq9fOdbs5kvTvQJh9tCsMKO9QuYMw7pbDUzPEANsdQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: GSoC 2014: Implementing clustering algorithms in MADlib  (Hai Qian <hqian@gopivotal.com>)
Список pgsql-hackers
Hi all,

2014-04-03 9:04 GMT+02:00 Hai Qian <hqian@gopivotal.com>:
Hi Maxence,

We are very glad that you are enthusiastic about MADlib.Your proposal looks
very nice. The algorithms that you proposed will be very useful, and can
serve as a good start. After learning about the C++ abstraction layer and
testing suite, you should be able to quickly grasp how to program for
MADlib. We can fill in the details of the plan later. The conflict in your
time schedule shouldn't be a problem.

It would be nice if you could provide a little bit more introduction to
these algorithms and their importance in the proposal.

Overall it is very good. And thank you again

Hai


As you requested, I've described the clustering algorithms on my github [0]. I'm copying it at the end of this email, just in case.
In the case of OPTICS, I still need some time to grasp the details of how it works, so I've only included a basic description of this one.

Regards,
Maxence

[0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst

Details about the different clustering algorithms
=================================================

This file describes the k-means algorithm, which is currently the only
clustering algorithm implemented in MADlib, and the k-medoids and
OPTICS algorithm, which I plan to implement this summer, as a GSoC
project. I'll also explain the DBSCAN algorithm, as OPTICS is only an
extension of this one.

The goal of all these algorithms is to determine clusters in a set of
points. While this problem is NP-hard, it can be solved efficiently
thanks to these algorithms, even though the result is usually not
perfect.

I've tried not to add too many details, and taken some shortcuts, so
there may be some inaccuracies. I wanted to keep this readable (not
sure that goal was achieved, though), but if anyone wants more
precisions, I'll gladly add them.

k-means
-------

The k-means algorithm is the one implemented in MADlib. It selects
*k* points, either randomly, among the dataset, or set by the user,
to use as initial centroids (center of clusters). *k* must be
defined by the user; the algorithm is unable to guess the number of
clusters.

All the points in the dataset are then affected to the nearest
centroid, thus making *k* clusters.

The next step is to compute the new centroids as a "mean" of all the
points in a cluster. Then reassign all the points to then new
centroids, and start all over again until there is no change in
cluster assignation.

This algorithm is usually pretty fast (even though it can be made to
take an exponential time to converge). Still, it is easy to get a
wrong result because of local convergence, e.g. having one cluster
split in two parts by the algorithm, or two clusters merged. It is
also pretty sensitive to outliers (points that don't obviously belong
to any cluster), and the final result depend greatly on the initial
centroids.

This algorithm doesn't work well with non-euclidean distance, in
general.

Another think to note is that k-means will result in Voronoi cell
shaped clusters, which is not always what we want.

k-medoids
---------

The k-medoids algorithm works the same way as k-means, save for a few
exceptions.

With k-medoids, the centroids are always points of the dataset (and
are then called medoids). The new medoids are the points in clusters
which minimize the sum of pairwise dissimilarities (in other terms,
the point for which the average distance to other points in the
cluster is minimal). This makes the algorithm less sensitive to
outliers than k-means.

This algorithm is computationnally more intensive than k-means, but
still fast.

As for k-means, it is possible to run the algorithm several times with
different initial centroids, and get the best result (i.e. the one
that minimizes the sum of distances from points to their centroids).

DBSCAN
-------

DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*)
is a clustering algorithm based on the density of clusters.

This adresses several limitations of k-means and k-medoids: it does
not assign a cluster to outliers, and allows the detection of
weird-shaped clusters. Moreover, it doesn't need to be told the number
of clusters.

A point is called dense if enough other points are near enough from
it, where "enough other points" and "near enough" are defined by the
parameters *min_points* and *epsilon*.

*min_points* therefore represents the minimum number of points
required to make a cluster, and *epsilon* is the maximum distance a
point can be from a cluster to be considered part of this cluster.

For every unassigned point, count his *epsilon*-neighbours (not sure
this term exists, but I'll use it for convenience). If there are too
few, consider this point an outlier; else, create a new cluster and
put the current point in it, along with its neighbours, the neighbours
of its neighbours, and so on.

The main problem with DBSCAN is that it doesn't work well for clusters
with very different densities.

OPTICS
------

OPTICS (*Ordering Points to Identify the Clustering Structure*)
addresses this issue.

The first step in OPTICS is to order the points so that two points
that are near each other will be in nearby positions, and store the
smallest reachability distance of each point (i.e. the distance at
which the point would be considered dense by DBSCAN).

The next step is to determine the clusters from the data obtained at
the previous step. This can be done visually: by plotting the ordered
points on *x* and their smallest reachability distance on *y*, the
clusters are the valleys formed by the curve.

I'll add more details to this section when I understand better how all
of this works.

Bibliography
------------
http://en.wikipedia.org : where clustering algorithms are well
detailed

http://www.stat.cmu.edu/~ryantibs/datamining/lectures/04-clus1-marked.pdf
: introduction to k-means and k-medoids

http://www.vitavonni.de/blog/201211/2012110201-dbscan-and-optics-clustering.html
: short introduction to DBSCAN and OPTICS

http://www.cs.uiuc.edu/class/fa05/cs591han/papers/ankerst99.pdf :
original paper presenting OPTICS


--
Maxence Ahlouche
06 06 66 97 00

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Windows exit code 128 ... it's baaack
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [bug fix] pg_ctl always uses the same event source