Cursor + upsert (astronomical data)

Поиск
Список
Период
Сортировка
От Jiří Nádvorník
Тема Cursor + upsert (astronomical data)
Дата
Msg-id 01d201cfa8be$dd746880$985d3980$@gmail.com
обсуждение исходный текст
Ответы Re: Cursor + upsert (astronomical data)  (Vitalii Tymchyshyn <vit@tym.im>)
Re: Cursor + upsert (astronomical data)  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-performance

Hello guys.

 

My issue kind of hits multiple topics, but the main question is about performance. I think you need to understand the background a little bit to be able to help me. So I will firstly define the problem and my solutions to it and place the questions for you to the end of this message.

 

Problem:

I have a table of observations of objects on the sky. The most important columns are the coordinates (x,y). All other columns in there are just additional information about the observation. The problem is that when I take an image of the same object on the sky twice, the coordinates x,y won’t be the same, they will be only close to each other. My task is to generate a common identifier to all of the observations of the same object and assign the observations to it (N:1 relation). The criterium is that all of the observations which are within 1 arcsec of this identifier are considered as the same object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.

The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.

 

Technical background:

I’m trying to keep the whole algoritm in DB if possible because I have a good PostgreSQL plugin Q3C for indexing the coordinates of the objects on the sky (https://code.google.com/p/q3c/source/browse/README.q3c). It also has quite a few stored procedures to look in that table for near neighbors which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius) returns true if the object y is within radius of the object x. It simply generates a list of index bitmap or queries with the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.

 

Solution:

After lot of experimentation with clustering the objects and trying to process them all together in one “silver-bullet” SQL query I decided to use some more simple approach. The main problem with the “process all at once approach” is that the finding neighbours for each observation is in definition quadratic and for 3e8 rows just runs out of disk space (~TBs of memory for the temporary results).

The simplest approach I could think of is that I process each row of the 3e8 rows sequentially and ask:

Do I have an identifier in the radius of 1 arcsec?

No: Generate one and assign me to it.

Yes: Update it and assigne me to it. The update is done as weighted average – I keep the number of how many observations the identifier has been computed. The result is that the identifier will have average coordinates of all the observations it identifies – it will be the center.

 

So here I come with my procedure. It has 3 params. The first two are range of oids to list in the table. Used for scaling and parallelization of the algorithm. The third is the radius in which to search for the neighbours.

 

DROP TYPE IF EXISTS coords;

CREATE TYPE coords AS (

                raj2000 double precision,

                dej2000 double precision

                );

 

 

DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);

CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int, radius double precision)

                RETURNS VOID AS $$

DECLARE

                cur1 CURSOR FOR

                               SELECT

                                               raj2000, dej2000

                               FROM

                                               \schema.observation AS obs

                               WHERE

                                               obs.oid >= fromOID

                               AND

                                               obs.oid < toOID;

                curr_raj2000 double precision;

                curr_dej2000 double precision;

                curr_coords_cat coords;

                cnt int;       

 

BEGIN

/*SELECT current_setting('transaction_isolation') into tmp;

raise notice 'Isolation level %', tmp;*/

OPEN cur1;

cnt:=0;

LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;

LOOP

                FETCH cur1 INTO curr_raj2000, curr_dej2000;

                               EXIT WHEN NOT found;

               

                WITH

                               upsert

                AS

                               (UPDATE

                                               \schema.objcat

                               SET

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 * weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 * weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),

                                               weight=weight+1

                               WHERE

                                               q3c_join(curr_raj2000, curr_dej2000,

                                                               raj2000, dej2000,

                                                               radius)

                               RETURNING *),

                ins AS

                (INSERT INTO

                                               \schema.objcat

                                               (ipix_cat, raj2000, dej2000, weight)

                               SELECT

                                               (q3c_ang2ipix(curr_raj2000, curr_dej2000)),

                                               curr_raj2000,

                                               curr_dej2000,

                                               1

                WHERE NOT EXISTS

                               (SELECT * FROM upsert)

                RETURNING *)

                UPDATE

                               \schema.observation

                SET

                               id_cat = (SELECT DISTINCT

                                                               id_cat

                                               FROM

                                                               upsert

                                               UNION

                                               SELECT

                                                               id_cat

                                               FROM

                                                               ins

                                               WHERE id_cat IS NOT NULL

                                               LIMIT 1)

                WHERE CURRENT OF cur1;

                cnt:=cnt+1;

 

                IF ((cnt % 100000 ) = 0) THEN

                               RAISE NOTICE 'Processed % entries', cnt;

                END IF;

                              

END LOOP;

CLOSE cur1;

END;

$$ LANGUAGE plpgsql;

 

Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour. Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.

 

The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.

 

Questions for you:

1.       The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?

2.       Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?

3.       Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.

 

I tried to sum this up the best I could – for more information please don’t hesitate to contact me.

 

Thank you very much for even reading this far.

 

Best Regards,

 

Jiri Nadvornik

 

 

В списке pgsql-performance по дате отправления:

Предыдущее
От: Reza Taheri
Дата:
Сообщение: Re: High rate of transaction failure with the Serializable Isolation Level
Следующее
От: Reza Taheri
Дата:
Сообщение: Re: High rate of transaction failure with the Serializable Isolation Level