Re: Cursor + upsert (astronomical data)

Поиск
Список
Период
Сортировка
От Craig James
Тема Re: Cursor + upsert (astronomical data)
Дата
Msg-id CAFwQ8reKBGki40i9BkB=PDj5F79NQ6rB1P=P9yvQqM_DsLU7gg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
Список pgsql-performance
Hi Jiri,

I’m really interested in those [clustering] algorithms and study them. But I would need somebody to point me directly at a specific algorithm to look at. The main problem with choosing the right one (which couldn’t get over even my university teacher) is that you don’t know the number of clusters (classification problem) and you don’t know the number of objects in one cluster (that isn’t such a big deal), but each cluster has different count of objects – which is a big issue because it actually disqualifies all of k-means based algorithms (correct me if I’m wrong).


I'm not an expert in clustering, I just worked with several people who developed clustering packages.

One algorithm that's widely used in chemistry and genetics is the Jarvis-Patrick algorithm. Google for "jarvis patrick clustering" and you'll find several good descriptions. It has the advantages that it's deterministic, it works with any distance metric, and it chooses the number of clusters (you don't have to specify ahead of time). The drawback to Jarvis-Patrick clustering is that you have to start with a nearest-neighbors list (i.e. for each item in your data set, you must find the nearest N items, where N is something like 10-20.) In a brute-force approach, finding nearest neighbors is O(N^2) which is bad, but if you have any method of partitioning observations to narrow down the range of neighbors that have to be examined, the running time can be dramatically reduced.

One advantage of JP clustering is that once you have the nearest-neighbors list (which is the time-consuming part), the actual JP clustering is fast. You can spend some time up front computing the nearest-neighbors list, and then run JP clustering a number of time with different parameters until you get the clusters you want.

A problem with J-P clustering in your situation is that it will tend to merge clusters. If you have two astronomical objects that have overlapping observations, they'll probable merge into one. You might be able to address this with a post-processing step that identified "bimodal" or "multimodal" clusters -- say by identifying a cluster with a distinctly asymmetrical or eliptical outline and splitting it. But I think any good clustering package would have trouble with these cases.

There are many other clustering algorithms, but many of them suffer from being non-deterministic, or from requiring a number-of-clusters parameter at the start.

That pretty much exhausts my knowledge of clustering. I hope this gives you a little insight.

Craig


В списке pgsql-performance по дате отправления:

Предыдущее
От: Jiří Nádvorník
Дата:
Сообщение: Re: Cursor + upsert (astronomical data)
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: Very slow planning performance on partition table