Re: Update with last known location?

Поиск
Список
Период
Сортировка
От James David Smith
Тема Re: Update with last known location?
Дата
Msg-id CAMu32ADkZbmtonvVtNGKot1rz-WxADJpY+O=KZhv42mL9T4gdw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Update with last known location?  (Erik Darling <edarling80@gmail.com>)
Ответы Re: Update with last known location?  (Erik Darling <edarling80@gmail.com>)
Список pgsql-novice
On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com> wrote:
> I would re-suggest using a CTE to contain each dataset to ensure your
> selects are distilling them correctly, and then using a final query to join
> them. You can then either update your data directly through the CTE(s), or
> insert the results to another table to do some further testing. I think
> you'll find this method presents the data a bit more ergonomically for
> analysis.
>
> http://www.postgresql.org/docs/9.3/static/queries-with.html
>
>
>
> On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
> <james.david.smith@gmail.com> wrote:
>>
>> Hi Erik/all,
>>
>> I just tried that, but it's tricky. The 'extra' data is indeed coming
>> from the right side of the join, but it's hard to select only the max
>> from it. Maybe it's possible but I've not managed to do it. Here is
>> where I am, which is so very close.
>>
>> SELECT
>> DISTINCT(a.ppid, a.point_time, a.the_geom) as
>> row_that_needs_geom_updating,
>> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
>> last_known_position_time
>> FROM
>> test a
>> INNER JOIN
>> (SELECT    ppid,
>> point_time,
>> the_geom
>> FROM    test
>> WHERE    the_geom IS NOT NULL) b
>> ON b.point_time < a.point_time
>> AND a.ppid = b.ppid
>> WHERE a.the_geom IS NULL;
>>
>> If you see attached screen-print, the output is the rows that I want.
>> However I've had to use DISTINCT to stop the duplication. Also I've
>> not managed to pull through 'the_geom' from the JOIN. I'm not sure
>> how. Anyone?
>>
>> But it's kind of working. :-)
>>
>> Worst case if I can't figure out how to solve this in one query I'll
>> have to store the result of the above, and then use it as a basis for
>> another query I think.
>>
>> Thanks
>>
>> James
>>
>>
>>
>> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com> wrote:
>> > I would try partitioning the second time you call row_number, perhaps by
>> > ID,
>> > and then selecting the MAX() from that, since I think the too much data
>> > you're referring to is coming from the right side of your join.
>> >
>> > On Jan 29, 2014 7:23 AM, "James David Smith"
>> > <james.david.smith@gmail.com>
>> > wrote:
>> >>
>> >> On 28 January 2014 23:15, Gavin Flower <GavinFlower@archidevsys.co.nz>
>> >> wrote:
>> >> > On 29/01/14 11:00, Kevin Grittner wrote:
>> >> >>
>> >> >> James David Smith <james.david.smith@gmail.com> wrote:
>> >> >>
>> >> >>> Given the data is so large I don't want to be taking the data out
>> >> >>> to a CSV or whatever and then loading it back in. I'd like to do
>> >> >>> this within the database using SQL. I thought I would be able to
>> >> >>> do this using a LOOP to be honest.
>> >> >>
>> >> >> I would be amazed if you couldn't do this with  a single UPDATE
>> >> >> statement.  I've generally found declarative forms of such work to
>> >> >> be at least one order of magnitude faster than going to either a PL
>> >> >> or a script approach.  I would start by putting together a SELECT
>> >> >> query using window functions and maybe a CTE or two to list all the
>> >> >> primary keys which need updating and the new values they should
>> >> >> have.  Once that SELECT was looking good, I would put it in the
>> >> >> FROM clause of an UPDATE statement.
>> >> >>
>> >> >> That should work, but if you are updating a large percentage of the
>> >> >> table, I would go one step further before running this against the
>> >> >> production tables.  I would put a LIMIT on the above-mentioned
>> >> >> SELECT of something like 10000 rows, and script a loop that
>> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the table.
>> >> >>
>> >> >> --
>> >> >> Kevin Grittner
>> >> >> EDB: http://www.enterprisedb.com
>> >> >> The Enterprise PostgreSQL Company
>> >> >>
>> >> >>
>> >> > James, you might consider dropping as many indexes on the table as
>> >> > you
>> >> > safely can, and rebuilding them after the mass update.  If you have
>> >> > lots
>> >> > of
>> >> > such indexes, you will find this apprtoach to be a lot faster.
>> >> >
>> >> >
>> >> > Cheers,
>> >> > Gavin
>> >>
>> >> Hi all,
>> >>
>> >> Thanks for your help and assistance. I think that window functions,
>> >> and inparticular the PARTITION function, is 100% the way to go.  I've
>> >> been concentrating on a SELECT statement for now and am close but not
>> >> quite close enough. The below query gets all the data I want, but
>> >> *too* much. What I've essentially done is:
>> >>
>> >> - Select all the rows that don't have any geom information
>> >> - Join them with all rows before this point that *do* have geom
>> >> information.
>> >> - Before doing this join, use partition to generate row numbers.
>> >>
>> >> The attached screen grab shows the result of my query below.
>> >> Unfortunately this is generating alot of joins that I don't want. This
>> >> won't be practical when doing it with 75,000 people.
>> >>
>> >> Thoughts and code suggestions very much appreciated... if needed I
>> >> could put together some SQL to create an example table?
>> >>
>> >> Thanks
>> >>
>> >> SELECT     row_number() OVER (PARTITION BY test.point_time ORDER BY
>> >> test.point_time) as test_row,
>> >>     test.ppid as test_ppid,
>> >>     test.point_time as test_point_time,
>> >>     test.the_geom as test_the_geom,
>> >>     a.ppid as a_ppid,
>> >>     a.point_time as a_point_time,
>> >>     a.the_geom as a_the_geom,
>> >>     a.a_row
>> >> FROM     test
>> >>     LEFT JOIN (
>> >>         SELECT    the_geom,
>> >>             ppid,
>> >>             point_time,
>> >>             row_number() OVER (ORDER BY ppid, point_time) as a_row
>> >>         FROM     test
>> >>         WHERE     the_geom IS NOT NULL) a
>> >> ON    a.point_time < test.point_time
>> >> AND    a.ppid = test.ppid
>> >> WHERE    test.the_geom IS NULL
>> >> ORDER BY test.point_time)
>> >>


Hi Erik / all,

So I think I've managed to re-write my queries using CTEs. The below
code now does get me the data that I want from this. But to do so it
is going to create a frankly huge table in the bit of the SQL where it
makes the table called 'partitioned'. My rough guess is that it'll
have to make a table of about 100 billion rows in order to get data I
need ( about 108 million rows).

Could someone please glance through it for me and suggest how to write
it more efficiently?

Thanks

James

WITH         missing_geoms AS (
SELECT        ppid,
        point_time,
        the_geom
FROM        hybrid_location
WHERE        the_geom IS NULL)
-----------------
        ,filled_geoms AS (
SELECT        ppid,
        point_time,
        the_geom
FROM        hybrid_location
WHERE        the_geom IS NOT NULL)
----------------
        ,partitioned AS (
SELECT        missing_geoms.ppid,
        missing_geoms.point_time,
        missing_geoms.the_geom,
        filled_geoms.ppid,
        filled_geoms.point_time,
        filled_geoms.the_geom,
        row_number() OVER (    PARTITION BY     missing_geoms.ppid,
                            missing_geoms.point_time
                    ORDER BY    missing_geoms.ppid,
                            missing_geoms.point_time,
                            filled_geoms.ppid,
                            filled_geoms.point_time DESC)
FROM        missing_geoms
LEFT JOIN    filled_geoms
ON        filled_geoms.point_time < missing_geoms.point_time
AND        filled_geoms.ppid = missing_geoms.ppid)
--------------
SELECT         *
FROM        partitioned
WHERE        row_number = 1;

James


В списке pgsql-novice по дате отправления:

Предыдущее
От: "Gurpreet Sachdeva -X (gusachde - ARICENT TECHNOLOGIES MAURIITIUS LIMITED at Cisco)"
Дата:
Сообщение: Stored Procedure in Java
Следующее
От: Erik Darling
Дата:
Сообщение: Re: Update with last known location?