Re: Cursor + upsert (astronomical data)

Поиск
Список
Период
Сортировка
От Craig James
Тема Re: Cursor + upsert (astronomical data)
Дата
Msg-id CAFwQ8rfW3B9JHqQnbQHYjdez=qOKF6a4Lc3pw=p3_ZQ6WKpitw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
Ответы Re: Cursor + upsert (astronomical data)
Re: Cursor + upsert (astronomical data)
Список pgsql-performance
Jiri,

If you haven't looked at clustering algorithms yet, you might want to do so. Your problem is a special case of clustering, where you have a large number of small clusters.  A good place to start is the overview on Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

A lot of people have worked extensively on this problem, and you might find a good solution, or at least some ideas to guide your own algorithm. In my field (chemistry), researchers often need to cluster 10^6 to 10^7 chemical compounds, and a great deal of research has gone into efficient ways to do so.

Craig



On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn




--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

В списке pgsql-performance по дате отправления:

Предыдущее
От: Jiří Nádvorník
Дата:
Сообщение: Re: Cursor + upsert (astronomical data)
Следующее
От: Rural Hunter
Дата:
Сообщение: Re: Very slow planning performance on partition table