Обсуждение: Fwd: Tricky join and update with same table

Поиск
Список
Период
Сортировка

Fwd: Tricky join and update with same table

От
James David Smith
Дата:
Hi all,

Some help if you can please. I have GPS data for a number of people. Each person turned the GPS on, at home, at a different time of the day. Before that time, I presume that the person was in their house. I have made a table for each person for a whole 24 hours (one record per minute), and I now want to 'fill in' their location for the rows before they turned the GPS on. So for each person I want to take the first row where the point_geom is not null, and update all of the rows above it with that value. It's driving me nuts.

spid     |     point_time                 |     point_geom
-----------------------------------------------------------
1         |     2012-01-01 00:01:00   |    
1         |     2012-01-01 00:02:00   |     
1         |     2012-01-01 00:03:00   |     POINT(X, Y)
1         |     2012-01-01 00:04:00   |     POINT(X, Y)
1         |     2012-01-01 00:05:00   |     POINT(X, Y)
2         |     2012-01-01 00:01:00   |     
2         |     2012-01-01 00:02:00   |     
2         |     2012-01-01 00:03:00   |     
2         |     2012-01-01 00:04:00   |     
2         |     2012-01-01 00:05:00   |     POINT(X, Y)
3         |     2012-01-01 00:01:00   |     
3         |     2012-01-01 00:02:00   |     POINT(X, Y)
3         |     2012-01-01 00:03:00   |     POINT(X, Y)     
3         |     2012-01-01 00:04:00   |     POINT(X, Y)
3         |     2012-01-01 00:05:00   |     POINT(X, Y)

I've managed to select the correct row using this:

SELECT spid, min(point_time) as point_time
FROM hlhs_day
WHERE point_geom IS NOT NULL
GROUP BY spid;

However when I try to add in the column 'point_geom' to the query, it won't work.

Thanks for your help and suggestions.

James

Re: Fwd: Tricky join and update with same table

От
Michael Swierczek
Дата:
On Fri, Jul 26, 2013 at 5:24 AM, James David Smith
<james.david.smith@gmail.com> wrote:
> Hi all,
>
> Some help if you can please. I have GPS data for a number of people. Each
> person turned the GPS on, at home, at a different time of the day. Before
> that time, I presume that the person was in their house. I have made a table
> for each person for a whole 24 hours (one record per minute), and I now want
> to 'fill in' their location for the rows before they turned the GPS on. So
> for each person I want to take the first row where the point_geom is not
> null, and update all of the rows above it with that value. It's driving me
> nuts.
>
> spid     |     point_time                 |     point_geom
> -----------------------------------------------------------
> 1         |     2012-01-01 00:01:00   |
> 1         |     2012-01-01 00:02:00   |
> 1         |     2012-01-01 00:03:00   |     POINT(X, Y)
> 1         |     2012-01-01 00:04:00   |     POINT(X, Y)
> 1         |     2012-01-01 00:05:00   |     POINT(X, Y)
> 2         |     2012-01-01 00:01:00   |
> 2         |     2012-01-01 00:02:00   |
> 2         |     2012-01-01 00:03:00   |
> 2         |     2012-01-01 00:04:00   |
> 2         |     2012-01-01 00:05:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:01:00   |
> 3         |     2012-01-01 00:02:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:03:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:04:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:05:00   |     POINT(X, Y)
>
> I've managed to select the correct row using this:
>
> SELECT spid, min(point_time) as point_time
> FROM hlhs_day
> WHERE point_geom IS NOT NULL
> GROUP BY spid;
>
> However when I try to add in the column 'point_geom' to the query, it won't
> work.
>
> Thanks for your help and suggestions.
>
> James

Usually when I chime in on questions like this, someone comes along
with a better solution after I finish.  But I think you can get the
information you want by "wrapping" that query:

SELECT hd1.spid, hd1.point_time, hd2.point_geom
FROM
(SELECT spid, min(point_time) as point_time
FROM hlhs_day
WHERE point_geom IS NOT NULL GROUP BY spid) as hd1
INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time =
hd2.point_time
ORDER BY hd1.spid;

I hope this helps.
--Mike


Re: Fwd: Tricky join and update with same table

От
James David Smith
Дата:
Hi Mike,

About half an hour before you replied I actually managed to come up with that myself! :-)

Using it in an update query is troubling me though. I've done it a few times now and it keeps updating the wrong fields.

Luca sent me an email a short while ago with some code to try however, and I'm doing that at the moment. Though it's been running for about 30 minutes now and isn't done. Though there are 230,000 rows, so perhaps that isn't that surprising after all. Here it is for posterity:

PDATE hlhs_day hd
SET point_geom = ( SELECT hd2.point_geom
                      FROM hlhs_day hd2
                      WHERE hd.spid = hd2.spid
                      AND hd2.point_geom IS NOT NULL
                      AND hd2.point_time = ( SELECT min( hd3.point_time )
                                           FROM hlhs_day hd3
                                           WHERE hd.spid = hd3.spid
                                           AND hd3.point_geom IS NOT NULL
                                           AND hd3.point_time > hd.point_time
                                           )
                      )
WHERE hd.point_geom IS null

Thanks

James






On 26 July 2013 14:37, Michael Swierczek <mike.swierczek@gmail.com> wrote:
On Fri, Jul 26, 2013 at 5:24 AM, James David Smith
<james.david.smith@gmail.com> wrote:
> Hi all,
>
> Some help if you can please. I have GPS data for a number of people. Each
> person turned the GPS on, at home, at a different time of the day. Before
> that time, I presume that the person was in their house. I have made a table
> for each person for a whole 24 hours (one record per minute), and I now want
> to 'fill in' their location for the rows before they turned the GPS on. So
> for each person I want to take the first row where the point_geom is not
> null, and update all of the rows above it with that value. It's driving me
> nuts.
>
> spid     |     point_time                 |     point_geom
> -----------------------------------------------------------
> 1         |     2012-01-01 00:01:00   |
> 1         |     2012-01-01 00:02:00   |
> 1         |     2012-01-01 00:03:00   |     POINT(X, Y)
> 1         |     2012-01-01 00:04:00   |     POINT(X, Y)
> 1         |     2012-01-01 00:05:00   |     POINT(X, Y)
> 2         |     2012-01-01 00:01:00   |
> 2         |     2012-01-01 00:02:00   |
> 2         |     2012-01-01 00:03:00   |
> 2         |     2012-01-01 00:04:00   |
> 2         |     2012-01-01 00:05:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:01:00   |
> 3         |     2012-01-01 00:02:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:03:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:04:00   |     POINT(X, Y)
> 3         |     2012-01-01 00:05:00   |     POINT(X, Y)
>
> I've managed to select the correct row using this:
>
> SELECT spid, min(point_time) as point_time
> FROM hlhs_day
> WHERE point_geom IS NOT NULL
> GROUP BY spid;
>
> However when I try to add in the column 'point_geom' to the query, it won't
> work.
>
> Thanks for your help and suggestions.
>
> James

Usually when I chime in on questions like this, someone comes along
with a better solution after I finish.  But I think you can get the
information you want by "wrapping" that query:

SELECT hd1.spid, hd1.point_time, hd2.point_geom
FROM
(SELECT spid, min(point_time) as point_time
FROM hlhs_day
WHERE point_geom IS NOT NULL GROUP BY spid) as hd1
INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time =
hd2.point_time
ORDER BY hd1.spid;

I hope this helps.
--Mike

Re: Fwd: Tricky join and update with same table

От
David Johnston
Дата:
James David Smith wrote
> So for each person I want to take the first row where the
> point_geom is not null, and update all of the rows above it with that
> value. It's driving me nuts.

*Not tested but the theory has worked for me in the past for similar
problems.

/*
 * Useful for situation where you can build an ordered array where the last
 * cell value represents the current row and in the case where that value is
NULL
 * you want to retrieve the last known non-null value; effectively
 * carrying that value forward to the current row.
 */
CREATE OR REPLACE FUNCTION array_last_nonnull(in_array anyarray)
RETURNS anyelement
AS $$

    SELECT unnest FROM (
    SELECT unnest, row_number() OVER () AS array_index FROM (
    SELECT unnest($1)
    ) explode ) filter
    WHERE unnest IS NOT NULL
    ORDER BY array_index DESC
    LIMIT 1;

$$
LANGUAGE sql
STRICT
IMMUTABLE
;

SELECT id, array_last_nonnull(
array_agg(value) OVER (PARTITION BY id ORDER BY idx DESC ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
,
array_agg(value) OVER (PARTITION BY id ORDER BY idx DESC ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM (VALUES (1,1,NULL), (1,2,1), (1,3,2), (2,1,NULL), (2,2,1)) val (id,
idx, value)
;

The function and array_agg window function expression could probably be
written differently - i.e., in a "first non-null" manner - but I already
have this function installed and understood so I used it instead.  The idea
to build an array for each ID, using a window function, of all possible
values and then pick out of that array the relevant non-null value that you
need.

Since you are looking forward, not backward, you must use the "RANGE" clause
as part of the window function definition.

Whether this performs well is another matter but one optimization you can
make, if the data supports it, is to limit the size of the frame so that the
corresponding array doesn't get too large.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Fwd-Tricky-join-and-update-with-same-table-tp5765253p5765284.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.


Re: Fwd: Tricky join and update with same table

От
Michael Swierczek
Дата:
On Fri, Jul 26, 2013 at 10:36 AM, James David Smith
<james.david.smith@gmail.com> wrote:
> Hi Mike,
>
> About half an hour before you replied I actually managed to come up with
> that myself! :-)
>
> Using it in an update query is troubling me though. I've done it a few times
> now and it keeps updating the wrong fields.
>
> Luca sent me an email a short while ago with some code to try however, and
> I'm doing that at the moment. Though it's been running for about 30 minutes
> now and isn't done. Though there are 230,000 rows, so perhaps that isn't
> that surprising after all. Here it is for posterity:
>
> PDATE hlhs_day hd
> SET point_geom = ( SELECT hd2.point_geom
>                       FROM hlhs_day hd2
>                       WHERE hd.spid = hd2.spid
>                       AND hd2.point_geom IS NOT NULL
>                       AND hd2.point_time = ( SELECT min( hd3.point_time )
>                                            FROM hlhs_day hd3
>                                            WHERE hd.spid = hd3.spid
>                                            AND hd3.point_geom IS NOT NULL
>                                            AND hd3.point_time >
> hd.point_time
>                                            )
>                       )
> WHERE hd.point_geom IS null
>
> Thanks
>
> James
>
>


This is a bit of a kluge solution, but if it were me I would take the
query I suggested (and you figured out on your own), put the results
in a text file, and then use Excel or some regular expressions in
Vim/Emacs/Notepad++/whatever to translate each line from
spidx,point_timex,point_geomx
to
update hlhs_day set point_geom = point_geomx where spid = spidx and
point_time  = point_timex;
Then run it through psql or pgAdmin3.

I believe that would probably run much more quickly, and you can break
it up into batches.

Good luck either way.
-Mike


>
>
>
>
> On 26 July 2013 14:37, Michael Swierczek <mike.swierczek@gmail.com> wrote:
>>
>> On Fri, Jul 26, 2013 at 5:24 AM, James David Smith
>> <james.david.smith@gmail.com> wrote:
>> > Hi all,
>> >
>> > Some help if you can please. I have GPS data for a number of people.
>> > Each
>> > person turned the GPS on, at home, at a different time of the day.
>> > Before
>> > that time, I presume that the person was in their house. I have made a
>> > table
>> > for each person for a whole 24 hours (one record per minute), and I now
>> > want
>> > to 'fill in' their location for the rows before they turned the GPS on.
>> > So
>> > for each person I want to take the first row where the point_geom is not
>> > null, and update all of the rows above it with that value. It's driving
>> > me
>> > nuts.
>> >
>> > spid     |     point_time                 |     point_geom
>> > -----------------------------------------------------------
>> > 1         |     2012-01-01 00:01:00   |
>> > 1         |     2012-01-01 00:02:00   |
>> > 1         |     2012-01-01 00:03:00   |     POINT(X, Y)
>> > 1         |     2012-01-01 00:04:00   |     POINT(X, Y)
>> > 1         |     2012-01-01 00:05:00   |     POINT(X, Y)
>> > 2         |     2012-01-01 00:01:00   |
>> > 2         |     2012-01-01 00:02:00   |
>> > 2         |     2012-01-01 00:03:00   |
>> > 2         |     2012-01-01 00:04:00   |
>> > 2         |     2012-01-01 00:05:00   |     POINT(X, Y)
>> > 3         |     2012-01-01 00:01:00   |
>> > 3         |     2012-01-01 00:02:00   |     POINT(X, Y)
>> > 3         |     2012-01-01 00:03:00   |     POINT(X, Y)
>> > 3         |     2012-01-01 00:04:00   |     POINT(X, Y)
>> > 3         |     2012-01-01 00:05:00   |     POINT(X, Y)
>> >
>> > I've managed to select the correct row using this:
>> >
>> > SELECT spid, min(point_time) as point_time
>> > FROM hlhs_day
>> > WHERE point_geom IS NOT NULL
>> > GROUP BY spid;
>> >
>> > However when I try to add in the column 'point_geom' to the query, it
>> > won't
>> > work.
>> >
>> > Thanks for your help and suggestions.
>> >
>> > James
>>
>> Usually when I chime in on questions like this, someone comes along
>> with a better solution after I finish.  But I think you can get the
>> information you want by "wrapping" that query:
>>
>> SELECT hd1.spid, hd1.point_time, hd2.point_geom
>> FROM
>> (SELECT spid, min(point_time) as point_time
>> FROM hlhs_day
>> WHERE point_geom IS NOT NULL GROUP BY spid) as hd1
>> INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time =
>> hd2.point_time
>> ORDER BY hd1.spid;
>>
>> I hope this helps.
>> --Mike
>
>


Re: Fwd: Tricky join and update with same table

От
Luca Ferrari
Дата:
On Fri, Jul 26, 2013 at 4:36 PM, James David Smith
<james.david.smith@gmail.com> wrote:
> Luca sent me an email a short while ago with some code to try however, and
> I'm doing that at the moment. Though it's been running for about 30 minutes
> now and isn't done. Though there are 230,000 rows, so perhaps that isn't
> that surprising after all. Here it is for posterity:
>

Ops...I was supposed to hit the "reply all" button!
Supposing it is working for you, you can at least split the update
into chunks insering a condition on the main query to update only rows
within a certain date range (let's say the most recent ones). That
will tell you if the query is working properly.
That is als the reason why I suggested using a trigger for further
inserts: I was suppsoing you had a lot of data and therefore doing an
update of chunks when the data is inserted does not make you have to
run a very long query.

Luca


Re: Fwd: Tricky join and update with same table

От
James David Smith
Дата:
On 26 July 2013 17:05, Luca Ferrari <fluca1978@infinito.it> wrote:
On Fri, Jul 26, 2013 at 4:36 PM, James David Smith
<james.david.smith@gmail.com> wrote:
> Luca sent me an email a short while ago with some code to try however, and
> I'm doing that at the moment. Though it's been running for about 30 minutes
> now and isn't done. Though there are 230,000 rows, so perhaps that isn't
> that surprising after all. Here it is for posterity:
>

Ops...I was supposed to hit the "reply all" button!
Supposing it is working for you, you can at least split the update
into chunks insering a condition on the main query to update only rows
within a certain date range (let's say the most recent ones). That
will tell you if the query is working properly.
That is als the reason why I suggested using a trigger for further
inserts: I was suppsoing you had a lot of data and therefore doing an
update of chunks when the data is inserted does not make you have to
run a very long query.

Luca

Hi Luca/Michael,

I'm afraid that your query didn't work. It updated too many rows. I've managed to get the work done that I needed to do using the below queries i.e. using a temporary table. However if anyone can figure out how to roll these into one big query that'd be very useful for me long term please.

SELECT 
point_geom,
point_time,
a.spid
INTO
first_moment
FROM
hlhs_day a
JOIN (
SELECT
hlhs_day.spid,
min(hlhs_day.point_time) as min_point_time
FROM
hlhs_day
JOIN
hlhs_day as other1
ON
hlhs_day.point_time = other1.point_time
WHERE
hlhs_day.point_geom IS NOT NULL
GROUP B
hlhs_day.spid
) c
ON
a.point_time = c.min_point_time
AND 
a.spid = c.spid
ORDER BY
spid,
point_time;


UPDATE
hlhs_day
SET
point_geom = first_moment.point_geom
FROM
first_moment
WHERE
(hlhs_day.point_time < first_moment.point_time
AND
hlhs_day.spid = first_moment.spid
AND
hlhs_day.point_geom IS NULL);

Thanks

James