Re: Cursor + upsert (astronomical data)

Поиск
Список
Период
Сортировка
От Jiří Nádvorník
Тема Re: Cursor + upsert (astronomical data)
Дата
Msg-id 005001cfab55$a0ace910$e206bb30$@gmail.com
обсуждение исходный текст
Ответ на Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
Список pgsql-performance
Hello Sergey,

Oh dear, I should have written you before several months :). I actually did exactly what you suggest - it was actually
MarkusDemleitners (GAVO org.) idea. The groupby in ipixcenter runs indeed less than an hour on the whole table and is
linearby definition so no problem with memory. I used I think bit 11, which means when I proclaim that the ipix is a
squareand it has the number of steradians (one of your func. I don't remember which), it has cca 0.7 arcsec side and
cca1.1 arcsec "half-diameter" which means it approximately covers the circle of 1 arcsec radius with the center in the
squarescenter - if you understand what I mean. 

This approach and it's quite good results I actually presented on IVOA Interop in Madrid this year (don't be afraid, I
pointedout, that I use your Q3C plugin for PostgreSQL).  

The problem actually comes when you have more than 2 of these ipix "squares" in line - when joining them, you don't
knowwhich ones to use. It causes about 22% of these joined ipix centers to have collisions - when I take the resulting
ipix(joined from the ipix centers) and look back for all the original observations from which these ipix centers where
constructed- they don't belong to the same object. It is because when you count with the ipix centers instead of the
originalobservations - in most pathological case you shift the observations from the bottom left corner to the top
uppercorner when joining with a square to your upper left. That means up to 2.2 arcsec shift which you can't backtrack. 

But. It definitely brought us closer to our goal - you don't have 3e8 observations, you have cca 7e7 observations. It's
stilltoo much to process with some quadratical algorithm, but it's something..  

Maybe I could use it along with my *naïve* sequential algorithm.. Could you look at the script attached if you could
seeany obvious inefficient statements? 

Thank you very much! You know even talking with someone about the problem brings you closer to the solution.

P.S.: I am sorry if I got a little confused with your name when addresing you - Are you the co-author of Q3C paper
wherethe name is S. Koposov (+ O. Bartunov)? 

Best Regards,

Jiri Nadvornik

-----Original Message-----
From: Sergey Karpov [mailto:karpov.sv@gmail.com]
Sent: Monday, July 28, 2014 5:03 PM
To: Oleg Bartunov
Cc: Jiří Nádvorník; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

Hi Jiri,

I understand your problem (and I actually have exactly the same in my sky monitoring experiment). Unfortunately, I have
nocomplete solution for it as of now. 

I just may suggest you to look at q3c_ipixcenter() function from q3c which returns the Q3C bin number for a given
coordinatesat a given binning length. Probably, you may first group your observations using this function, selecting
thebinning depth to roughly match your desired angular resolution, and then probably merge the (much shorter) list of
resultingbin numbers (sort of common identifier you are looking for) for corner cases (when the same cluster of
observationalpoints lies between two bins). I did not try this approach myself yet, but probably it will be
significantlyfaster. 

Good luck!

Sergey

2014-07-27 20:56 GMT+04:00 Oleg Bartunov <obartunov@gmail.com>:
> Jiri,
>
> as I understand your problem is called crossmatch ?  I attach pdf of
> our work in progress, where we compared several spatial indexing
> techniques, including postgis, q3c and new pgsphere.  Sergey Karpov
> made these benchmarks.
>
> New pgsphere you may find here
> https://github.com/akorotkov/pgsphere
>
> Oleg
>
> On Sat, Jul 26, 2014 at 1:46 PM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:
>> Hello guys.
>>
>>
>>
>> My issue kind of hits multiple topics, but the main question is about
>> performance. I think you need to understand the background a little
>> bit to be able to help me. So I will firstly define the problem and
>> my solutions to it and place the questions for you to the end of this message.
>>
>>
>>
>> Problem:
>>
>> I have a table of observations of objects on the sky. The most
>> important columns are the coordinates (x,y). All other columns in
>> there are just additional information about the observation. The
>> problem is that when I take an image of the same object on the sky
>> twice, the coordinates x,y won’t be the same, they will be only close
>> to each other. My task is to generate a common identifier to all of
>> the observations of the same object and assign the observations to it
>> (N:1 relation). The criterium is that all of the observations which
>> are within 1 arcsec of this identifier are considered as the same
>> object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.
>>
>> The reason why I solve the performance issues here is that the table
>> of observations has atm cca 3e8 rows after 1.5 year of gathering the
>> data. The number growth is linear.
>>
>>
>>
>> Technical background:
>>
>> I’m trying to keep the whole algoritm in DB if possible because I
>> have a good PostgreSQL plugin Q3C for indexing the coordinates of the
>> objects on the sky
>> (https://code.google.com/p/q3c/source/browse/README.q3c). It also has
>> quite a few stored procedures to look in that table for near
>> neighbors which is what I’m doing. The function q3c_join(x1, x2, y1,
>> y2, radius) returns true if the object y is within radius of the
>> object x. It simply generates a list of index bitmap or queries with
>> the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.
>>
>>
>>
>> Solution:
>>
>> After lot of experimentation with clustering the objects and trying
>> to process them all together in one “silver-bullet” SQL query I
>> decided to use some more simple approach. The main problem with the
>> “process all at once approach” is that the finding neighbours for
>> each observation is in definition quadratic and for 3e8 rows just
>> runs out of disk space (~TBs of memory for the temporary results).
>>
>> The simplest approach I could think of is that I process each row of
>> the 3e8 rows sequentially and ask:
>>
>> Do I have an identifier in the radius of 1 arcsec?
>>
>> No: Generate one and assign me to it.
>>
>> Yes: Update it and assigne me to it. The update is done as weighted
>> average – I keep the number of how many observations the identifier
>> has been computed. The result is that the identifier will have
>> average coordinates of all the observations it identifies – it will be the center.
>>
>>
>>
>> So here I come with my procedure. It has 3 params. The first two are
>> range of oids to list in the table. Used for scaling and
>> parallelization of the algorithm. The third is the radius in which to search for the neighbours.
>>
>>
>>
>> DROP TYPE IF EXISTS coords;
>>
>> CREATE TYPE coords AS (
>>
>>                 raj2000 double precision,
>>
>>                 dej2000 double precision
>>
>>                 );
>>
>>
>>
>>
>>
>> DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);
>>
>> CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int,
>> radius double precision)
>>
>>                 RETURNS VOID AS $$
>>
>> DECLARE
>>
>>                 cur1 CURSOR FOR
>>
>>                                SELECT
>>
>>                                                raj2000, dej2000
>>
>>                                FROM
>>
>>                                                \schema.observation AS
>> obs
>>
>>                                WHERE
>>
>>                                                obs.oid >= fromOID
>>
>>                                AND
>>
>>                                                obs.oid < toOID;
>>
>>                 curr_raj2000 double precision;
>>
>>                 curr_dej2000 double precision;
>>
>>                 curr_coords_cat coords;
>>
>>                 cnt int;
>>
>>
>>
>> BEGIN
>>
>> /*SELECT current_setting('transaction_isolation') into tmp;
>>
>> raise notice 'Isolation level %', tmp;*/
>>
>> OPEN cur1;
>>
>> cnt:=0;
>>
>> LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;
>>
>> LOOP
>>
>>                 FETCH cur1 INTO curr_raj2000, curr_dej2000;
>>
>>                                EXIT WHEN NOT found;
>>
>>
>>
>>                 WITH
>>
>>                                upsert
>>
>>                 AS
>>
>>                                (UPDATE
>>
>>                                                \schema.objcat
>>
>>                                SET
>>
>>                                                ipix_cat=q3c_ang2ipix(
>>
>>
>> (raj2000 * weight + curr_raj2000) / (weight + 1),
>>
>>
>> (dej2000 * weight + curr_dej2000) / (weight + 1)
>>
>>                                                ),
>>
>>                                                raj2000 = (raj2000 *
>> weight +
>> curr_raj2000) / (weight + 1),
>>
>>                                                dej2000 = (dej2000 *
>> weight +
>> curr_dej2000) / (weight + 1),
>>
>>                                                weight=weight+1
>>
>>                                WHERE
>>
>>                                                q3c_join(curr_raj2000,
>> curr_dej2000,
>>
>>
>> raj2000, dej2000,
>>
>>
>> radius)
>>
>>                                RETURNING *),
>>
>>                 ins AS
>>
>>                 (INSERT INTO
>>
>>                                                \schema.objcat
>>
>>                                                (ipix_cat, raj2000,
>> dej2000,
>> weight)
>>
>>                                SELECT
>>
>>
>> (q3c_ang2ipix(curr_raj2000, curr_dej2000)),
>>
>>                                                curr_raj2000,
>>
>>                                                curr_dej2000,
>>
>>                                                1
>>
>>                 WHERE NOT EXISTS
>>
>>                                (SELECT * FROM upsert)
>>
>>                 RETURNING *)
>>
>>                 UPDATE
>>
>>                                \schema.observation
>>
>>                 SET
>>
>>                                id_cat = (SELECT DISTINCT
>>
>>                                                                id_cat
>>
>>                                                FROM
>>
>>                                                                upsert
>>
>>                                                UNION
>>
>>                                                SELECT
>>
>>                                                                id_cat
>>
>>                                                FROM
>>
>>                                                                ins
>>
>>                                                WHERE id_cat IS NOT
>> NULL
>>
>>                                                LIMIT 1)
>>
>>                 WHERE CURRENT OF cur1;
>>
>>                 cnt:=cnt+1;
>>
>>
>>
>>                 IF ((cnt % 100000 ) = 0) THEN
>>
>>                                RAISE NOTICE 'Processed % entries',
>> cnt;
>>
>>                 END IF;
>>
>>
>>
>> END LOOP;
>>
>> CLOSE cur1;
>>
>> END;
>>
>> $$ LANGUAGE plpgsql;
>>
>>
>>
>> Results: When I run the query only once (1 client thread), it runs
>> cca 1 mil rows per hour. Which is days for the whole dataset. When I
>> run it in parallel with that lock to ensure pessimistic
>> synchronization, it runs sequentially too J the other threads just
>> waiting. When I delete that lock and hope to solve the resulting
>> conflicts later, the ssd disk serves up to 4 threads relatively
>> effectively – which can divide my days of time by 4 – still inacceptable.
>>
>>
>>
>> The reason is quite clear here – I’m trying to write something in one
>> cycle of the script to a table – then in the following cycle I need
>> to read that information.
>>
>>
>>
>> Questions for you:
>>
>> 1.       The first question is if you can think of a better way how to do
>> this and maybe if SQL is even capable of doing such thing – or do I
>> have to do it in C? Would rewriting the SQL function to C help?
>>
>> 2.       Could I somehow bend the commiting during the algorithm for my
>> thing? Ensure that inside one cycle, the whole part of the
>> identifiers table would be kept in memory for faster lookups?
>>
>> 3.       Why is this so slow? J It is comparable to the quadratic algorithm
>> in the terms of speed – only does not use any memory.
>>
>>
>>
>> I tried to sum this up the best I could – for more information please
>> don’t hesitate to contact me.
>>
>>
>>
>> Thank you very much for even reading this far.
>>
>>
>>
>> Best Regards,
>>
>>
>>
>> Jiri Nadvornik
>>
>>
>>
>>

Вложения

В списке pgsql-performance по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: Very slow planning performance on partition table
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: Cursor + upsert (astronomical data)