Re: Update with last known location?

Поиск
Список
Период
Сортировка
От Erik Darling
Тема Re: Update with last known location?
Дата
Msg-id CAO+EYw+XX=Lye3Vwv2uGBLQsrSE4Le8OdfJh-XnMjtg9oTO3oQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Update with last known location?  (James David Smith <james.david.smith@gmail.com>)
Список pgsql-novice
That solution looks like it totally works. Cool.
 
I was running it against a Redshift instance (only thing available to me at work), which doesn't allow me to write subqueries like that. That's why I was playing with LAG() instead.
 
An error occurred when executing the SQL command:
SELECT
   data.ppid,
   data.point_time,
   CASE
     WHEN data.the_geom IS NULL
     THEN (
       --Get all locations ...
ERROR: This type of correlated subquery pattern is not supported yet [SQL State=0A000]
Execution time: 0.09s
1 statement(s) failed.
 


On Thu, Jan 30, 2014 at 10:28 AM, James David Smith <james.david.smith@gmail.com> wrote:
You're a star Erik. Thanks again. I'll see how this goes.

I posted the question on stack-exchange earlier too by the way and
someone replied with the below code. So I'm going to try both. Cheers
again.

SELECT
  data.ppid,
  data.point_time,
  CASE
    WHEN data.the_geom IS NULL
    THEN (
      --Get all locations with an earlier time stamp for that ppid
      SELECT geom.the_geom
      FROM test_data geom
      WHERE data.ppid = geom.ppid
      AND geom.point_time < data.point_time
      AND geom.the_geom IS NOT NULL
      AND NOT EXISTS (
        -- Cull all but the most recent one
        SELECT *
        FROM test_data cull
        WHERE cull.ppid = geom.ppid
        AND geom.the_geom IS NOT NULL
        AND cull.point_time < data.point_time
        AND cull.point_time > geom.point_time
        AND cull.the_geom IS NOT NULL
        )
    )
  ELSE data.the_geom
  end
FROM test_data data

On 30 January 2014 15:24, Erik Darling <edarling80@gmail.com> wrote:
> Hi James,
>
> This is pretty close, but I have to get back to work for the time being.
> Feel free to mess with it. If you don't come up with a fully working
> solution, I can look at it later tonight (US EST). From your test data, only
> one row is missing the correct geom.
>
> WITH C (id, pt, gm, dr) as (
> SELECT ppid, point_time, the_geom,
>         dense_rank() over (partition by ppid, the_geom order by ppid,
> point_time) as dr
> FROM test_data
> )
> SELECT id, pt, gm,
>       CASE WHEN gm IS NULL THEN
>       LAG(gm, cast(c.DR as int) ) OVER (PARTITION BY id ORDER BY id, pt)
>       ELSE gm END as gm2
> FROM C
> ORDER BY id, pt, gm
>
>
>
> On Thu, Jan 30, 2014 at 8:58 AM, James David Smith
> <james.david.smith@gmail.com> wrote:
>>
>> All,
>>
>> Here's a SQL fiddle of my problem:
>>
>> http://sqlfiddle.com/#!15/77157
>>
>> Thanks
>>
>> James
>>
>> On 30 January 2014 11:19, James David Smith <james.david.smith@gmail.com>
>> wrote:
>> > Hi Erik / all,
>> >
>> > I don't think that will work, as what happens if one of the people has
>> > two missing periods of time within their day?
>> >
>> > I've made a self-contained example of my problem below. Would you mind
>> > trying to produce in the code below what you think I should do?  Or if
>> > anyone else fancies having a go then please do.
>> >
>> > I very much appreciate your help by the way. Thank you. I'm really at
>> > a loss with this. :-(
>> >
>> > James
>> > --------------------------------------
>> >
>> > DROP TABLE test_data;
>> >
>> > CREATE TABLE test_data(
>> > ppid integer,
>> > point_time timestamp without time zone,
>> > the_geom integer);
>> >
>> > INSERT INTO test_data VALUES
>> > ('1', '2012-01-01 07:00', '1'),
>> > ('1', '2012-01-01 07:01', '1'),
>> > ('1', '2012-01-01 07:02', '1'),
>> > ('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1
>> > ('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1
>> > ('1', '2012-01-01 07:05', '5'),
>> > ('1', '2012-01-01 07:06', '5'),
>> > ('1', '2012-01-01 07:07', '5'),
>> > ('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5
>> > ('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5
>> > ('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5
>> > ('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5
>> > ('2', '2013-05-02 07:12', '24'),
>> > ('2', '2013-05-02 07:13', '24'),
>> > ('2', '2013-05-02 07:14', '24'),
>> > ('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24
>> > ('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24
>> > ('2', '2013-05-02 07:17', '44'),
>> > ('2', '2013-05-02 07:18', '44'),
>> > ('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44
>> > ('2', '2013-05-02 07:20', '4'),
>> > ('2', '2013-05-02 07:21', '4'),
>> > ('2', '2013-05-02 07:22', '4');
>> >
>> > WITH         missing_geoms AS (
>> > SELECT        ppid,
>> >         point_time,
>> >         the_geom
>> > FROM        test_data
>> > WHERE        the_geom IS NULL)
>> > ---
>> >         ,filled_geoms AS (
>> > SELECT        ppid,
>> >         point_time,
>> >         the_geom
>> > FROM        test_data
>> > WHERE        the_geom IS NOT NULL)
>> > ---
>> >         ,partitioned AS (
>> > SELECT        missing_geoms.ppid as missing_geoms_ppid,
>> >         missing_geoms.point_time as missing_geoms_point_time,
>> >         missing_geoms.the_geom as missing_geoms_the_geom,
>> >         filled_geoms.ppid as filled_geoms_ppid,
>> >         filled_geoms.point_time as filled_geoms_point_time,
>> >         filled_geoms.the_geom as filled_geoms_the_geom
>> > FROM        missing_geoms
>> > LEFT JOIN    filled_geoms
>> > ON        filled_geoms.point_time < missing_geoms.point_time
>> > AND        filled_geoms.ppid = missing_geoms.ppid
>> > ORDER BY    missing_geoms_ppid,
>> >         missing_geoms_point_time)
>> > ---
>> > SELECT        *
>> > FROM        partitioned;
>> >
>> >
>> >
>> >
>> >
>> > On 29 January 2014 22:10, Erik Darling <edarling80@gmail.com> wrote:
>> >> Hi James,
>> >>
>> >> Yeah, that looks like the right place to me.
>> >>
>> >>
>> >> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith
>> >> <james.david.smith@gmail.com> wrote:
>> >>>
>> >>> Hi Erik,
>> >>>
>> >>> Do you mean in this section of the SQL?
>> >>> .....
>> >>> filled_geoms AS (
>> >>> SELECT
>> >>> ppid,
>> >>> point_time,
>> >>> the_geom
>> >>> FROM
>> >>> hybrid_location
>> >>> WHERE
>> >>> the_geom IS NOT NULL)
>> >>> ...
>> >>>
>> >>> Thanks
>> >>>
>> >>> James
>> >>>
>> >>> On 29 January 2014 17:57, Erik Darling <edarling80@gmail.com> wrote:
>> >>> > Hi James,
>> >>> >
>> >>> > I think you're still stuck with sort of unnecessary ('too much' )
>> >>> > data
>> >>> > coming from the right side of your left join. If so, one option I
>> >>> > would
>> >>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the
>> >>> > filled_geoms table. If you partition by id and order by date
>> >>> > descending,
>> >>> > you
>> >>> > can do an additional d_rank = 1 filter to only get the most recent
>> >>> > activity.
>> >>> > I believe this is what you want to set your NULL values to, no?
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith
>> >>> > <james.david.smith@gmail.com> wrote:
>> >>> >>
>> >>> >> On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com>
>> >>> >> wrote:
>> >>> >> > I would re-suggest using a CTE to contain each dataset to ensure
>> >>> >> > your
>> >>> >> > selects are distilling them correctly, and then using a final
>> >>> >> > query
>> >>> >> > to
>> >>> >> > join
>> >>> >> > them. You can then either update your data directly through the
>> >>> >> > CTE(s),
>> >>> >> > or
>> >>> >> > insert the results to another table to do some further testing. I
>> >>> >> > think
>> >>> >> > you'll find this method presents the data a bit more
>> >>> >> > ergonomically
>> >>> >> > for
>> >>> >> > analysis.
>> >>> >> >
>> >>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
>> >>> >> > <james.david.smith@gmail.com> wrote:
>> >>> >> >>
>> >>> >> >> Hi Erik/all,
>> >>> >> >>
>> >>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed
>> >>> >> >> coming
>> >>> >> >> from the right side of the join, but it's hard to select only
>> >>> >> >> the
>> >>> >> >> max
>> >>> >> >> from it. Maybe it's possible but I've not managed to do it. Here
>> >>> >> >> is
>> >>> >> >> where I am, which is so very close.
>> >>> >> >>
>> >>> >> >> SELECT
>> >>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as
>> >>> >> >> row_that_needs_geom_updating,
>> >>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
>> >>> >> >> last_known_position_time
>> >>> >> >> FROM
>> >>> >> >> test a
>> >>> >> >> INNER JOIN
>> >>> >> >> (SELECT    ppid,
>> >>> >> >> point_time,
>> >>> >> >> the_geom
>> >>> >> >> FROM    test
>> >>> >> >> WHERE    the_geom IS NOT NULL) b
>> >>> >> >> ON b.point_time < a.point_time
>> >>> >> >> AND a.ppid = b.ppid
>> >>> >> >> WHERE a.the_geom IS NULL;
>> >>> >> >>
>> >>> >> >> If you see attached screen-print, the output is the rows that I
>> >>> >> >> want.
>> >>> >> >> However I've had to use DISTINCT to stop the duplication. Also
>> >>> >> >> I've
>> >>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not
>> >>> >> >> sure
>> >>> >> >> how. Anyone?
>> >>> >> >>
>> >>> >> >> But it's kind of working. :-)
>> >>> >> >>
>> >>> >> >> Worst case if I can't figure out how to solve this in one query
>> >>> >> >> I'll
>> >>> >> >> have to store the result of the above, and then use it as a
>> >>> >> >> basis
>> >>> >> >> for
>> >>> >> >> another query I think.
>> >>> >> >>
>> >>> >> >> Thanks
>> >>> >> >>
>> >>> >> >> James
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >> > I would try partitioning the second time you call row_number,
>> >>> >> >> > perhaps
>> >>> >> >> > by
>> >>> >> >> > ID,
>> >>> >> >> > and then selecting the MAX() from that, since I think the too
>> >>> >> >> > much
>> >>> >> >> > data
>> >>> >> >> > you're referring to is coming from the right side of your
>> >>> >> >> > join.
>> >>> >> >> >
>> >>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith"
>> >>> >> >> > <james.david.smith@gmail.com>
>> >>> >> >> > wrote:
>> >>> >> >> >>
>> >>> >> >> >> On 28 January 2014 23:15, Gavin Flower
>> >>> >> >> >> <GavinFlower@archidevsys.co.nz>
>> >>> >> >> >> wrote:
>> >>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote:
>> >>> >> >> >> >>
>> >>> >> >> >> >> James David Smith <james.david.smith@gmail.com> wrote:
>> >>> >> >> >> >>
>> >>> >> >> >> >>> Given the data is so large I don't want to be taking the
>> >>> >> >> >> >>> data
>> >>> >> >> >> >>> out
>> >>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd
>> >>> >> >> >> >>> like to
>> >>> >> >> >> >>> do
>> >>> >> >> >> >>> this within the database using SQL. I thought I would be
>> >>> >> >> >> >>> able
>> >>> >> >> >> >>> to
>> >>> >> >> >> >>> do this using a LOOP to be honest.
>> >>> >> >> >> >>
>> >>> >> >> >> >> I would be amazed if you couldn't do this with  a single
>> >>> >> >> >> >> UPDATE
>> >>> >> >> >> >> statement.  I've generally found declarative forms of such
>> >>> >> >> >> >> work
>> >>> >> >> >> >> to
>> >>> >> >> >> >> be at least one order of magnitude faster than going to
>> >>> >> >> >> >> either
>> >>> >> >> >> >> a
>> >>> >> >> >> >> PL
>> >>> >> >> >> >> or a script approach.  I would start by putting together a
>> >>> >> >> >> >> SELECT
>> >>> >> >> >> >> query using window functions and maybe a CTE or two to
>> >>> >> >> >> >> list
>> >>> >> >> >> >> all
>> >>> >> >> >> >> the
>> >>> >> >> >> >> primary keys which need updating and the new values they
>> >>> >> >> >> >> should
>> >>> >> >> >> >> have.  Once that SELECT was looking good, I would put it
>> >>> >> >> >> >> in
>> >>> >> >> >> >> the
>> >>> >> >> >> >> FROM clause of an UPDATE statement.
>> >>> >> >> >> >>
>> >>> >> >> >> >> That should work, but if you are updating a large
>> >>> >> >> >> >> percentage
>> >>> >> >> >> >> of
>> >>> >> >> >> >> the
>> >>> >> >> >> >> table, I would go one step further before running this
>> >>> >> >> >> >> against
>> >>> >> >> >> >> the
>> >>> >> >> >> >> production tables.  I would put a LIMIT on the
>> >>> >> >> >> >> above-mentioned
>> >>> >> >> >> >> SELECT of something like 10000 rows, and script a loop
>> >>> >> >> >> >> that
>> >>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the
>> >>> >> >> >> >> table.
>> >>> >> >> >> >>
>> >>> >> >> >> >> --
>> >>> >> >> >> >> Kevin Grittner
>> >>> >> >> >> >> EDB: http://www.enterprisedb.com
>> >>> >> >> >> >> The Enterprise PostgreSQL Company
>> >>> >> >> >> >>
>> >>> >> >> >> >>
>> >>> >> >> >> > James, you might consider dropping as many indexes on the
>> >>> >> >> >> > table
>> >>> >> >> >> > as
>> >>> >> >> >> > you
>> >>> >> >> >> > safely can, and rebuilding them after the mass update.  If
>> >>> >> >> >> > you
>> >>> >> >> >> > have
>> >>> >> >> >> > lots
>> >>> >> >> >> > of
>> >>> >> >> >> > such indexes, you will find this apprtoach to be a lot
>> >>> >> >> >> > faster.
>> >>> >> >> >> >
>> >>> >> >> >> >
>> >>> >> >> >> > Cheers,
>> >>> >> >> >> > Gavin
>> >>> >> >> >>
>> >>> >> >> >> Hi all,
>> >>> >> >> >>
>> >>> >> >> >> Thanks for your help and assistance. I think that window
>> >>> >> >> >> functions,
>> >>> >> >> >> and inparticular the PARTITION function, is 100% the way to
>> >>> >> >> >> go.
>> >>> >> >> >> I've
>> >>> >> >> >> been concentrating on a SELECT statement for now and am close
>> >>> >> >> >> but
>> >>> >> >> >> not
>> >>> >> >> >> quite close enough. The below query gets all the data I want,
>> >>> >> >> >> but
>> >>> >> >> >> *too* much. What I've essentially done is:
>> >>> >> >> >>
>> >>> >> >> >> - Select all the rows that don't have any geom information
>> >>> >> >> >> - Join them with all rows before this point that *do* have
>> >>> >> >> >> geom
>> >>> >> >> >> information.
>> >>> >> >> >> - Before doing this join, use partition to generate row
>> >>> >> >> >> numbers.
>> >>> >> >> >>
>> >>> >> >> >> The attached screen grab shows the result of my query below.
>> >>> >> >> >> Unfortunately this is generating alot of joins that I don't
>> >>> >> >> >> want.
>> >>> >> >> >> This
>> >>> >> >> >> won't be practical when doing it with 75,000 people.
>> >>> >> >> >>
>> >>> >> >> >> Thoughts and code suggestions very much appreciated... if
>> >>> >> >> >> needed
>> >>> >> >> >> I
>> >>> >> >> >> could put together some SQL to create an example table?
>> >>> >> >> >>
>> >>> >> >> >> Thanks
>> >>> >> >> >>
>> >>> >> >> >> SELECT     row_number() OVER (PARTITION BY test.point_time
>> >>> >> >> >> ORDER
>> >>> >> >> >> BY
>> >>> >> >> >> test.point_time) as test_row,
>> >>> >> >> >>     test.ppid as test_ppid,
>> >>> >> >> >>     test.point_time as test_point_time,
>> >>> >> >> >>     test.the_geom as test_the_geom,
>> >>> >> >> >>     a.ppid as a_ppid,
>> >>> >> >> >>     a.point_time as a_point_time,
>> >>> >> >> >>     a.the_geom as a_the_geom,
>> >>> >> >> >>     a.a_row
>> >>> >> >> >> FROM     test
>> >>> >> >> >>     LEFT JOIN (
>> >>> >> >> >>         SELECT    the_geom,
>> >>> >> >> >>             ppid,
>> >>> >> >> >>             point_time,
>> >>> >> >> >>             row_number() OVER (ORDER BY ppid, point_time) as
>> >>> >> >> >> a_row
>> >>> >> >> >>         FROM     test
>> >>> >> >> >>         WHERE     the_geom IS NOT NULL) a
>> >>> >> >> >> ON    a.point_time < test.point_time
>> >>> >> >> >> AND    a.ppid = test.ppid
>> >>> >> >> >> WHERE    test.the_geom IS NULL
>> >>> >> >> >> ORDER BY test.point_time)
>> >>> >> >> >>
>> >>> >>
>> >>> >>
>> >>> >> Hi Erik / all,
>> >>> >>
>> >>> >> So I think I've managed to re-write my queries using CTEs. The
>> >>> >> below
>> >>> >> code now does get me the data that I want from this. But to do so
>> >>> >> it
>> >>> >> is going to create a frankly huge table in the bit of the SQL where
>> >>> >> it
>> >>> >> makes the table called 'partitioned'. My rough guess is that it'll
>> >>> >> have to make a table of about 100 billion rows in order to get data
>> >>> >> I
>> >>> >> need ( about 108 million rows).
>> >>> >>
>> >>> >> Could someone please glance through it for me and suggest how to
>> >>> >> write
>> >>> >> it more efficiently?
>> >>> >>
>> >>> >> Thanks
>> >>> >>
>> >>> >> James
>> >>> >>
>> >>> >> WITH         missing_geoms AS (
>> >>> >> SELECT        ppid,
>> >>> >>         point_time,
>> >>> >>         the_geom
>> >>> >> FROM        hybrid_location
>> >>> >> WHERE        the_geom IS NULL)
>> >>> >> -----------------
>> >>> >>         ,filled_geoms AS (
>> >>> >> SELECT        ppid,
>> >>> >>         point_time,
>> >>> >>         the_geom
>> >>> >> FROM        hybrid_location
>> >>> >> WHERE        the_geom IS NOT NULL)
>> >>> >> ----------------
>> >>> >>         ,partitioned AS (
>> >>> >> SELECT        missing_geoms.ppid,
>> >>> >>         missing_geoms.point_time,
>> >>> >>         missing_geoms.the_geom,
>> >>> >>         filled_geoms.ppid,
>> >>> >>         filled_geoms.point_time,
>> >>> >>         filled_geoms.the_geom,
>> >>> >>         row_number() OVER (    PARTITION BY     missing_geoms.ppid,
>> >>> >>                             missing_geoms.point_time
>> >>> >>                     ORDER BY    missing_geoms.ppid,
>> >>> >>                             missing_geoms.point_time,
>> >>> >>                             filled_geoms.ppid,
>> >>> >>                             filled_geoms.point_time DESC)
>> >>> >> FROM        missing_geoms
>> >>> >> LEFT JOIN    filled_geoms
>> >>> >> ON        filled_geoms.point_time < missing_geoms.point_time
>> >>> >> AND        filled_geoms.ppid = missing_geoms.ppid)
>> >>> >> --------------
>> >>> >> SELECT         *
>> >>> >> FROM        partitioned
>> >>> >> WHERE        row_number = 1;
>> >>> >>
>> >>> >> James
>> >>> >
>> >>> >
>> >>
>> >>
>
>

В списке pgsql-novice по дате отправления:

Предыдущее
От: James David Smith
Дата:
Сообщение: Re: Update with last known location?
Следующее
От: David Johnston
Дата:
Сообщение: Re: Update with last known location?