Обсуждение: Fwd: Tricky join and update with same table
Hi all,
Some help if you can please. I have GPS data for a number of people. Each person turned the GPS on, at home, at a different time of the day. Before that time, I presume that the person was in their house. I have made a table for each person for a whole 24 hours (one record per minute), and I now want to 'fill in' their location for the rows before they turned the GPS on. So for each person I want to take the first row where the point_geom is not null, and update all of the rows above it with that value. It's driving me nuts.
spid | point_time | point_geom
-----------------------------------------------------------
1 | 2012-01-01 00:01:00 |
1 | 2012-01-01 00:02:00 |
1 | 2012-01-01 00:03:00 | POINT(X, Y)
1 | 2012-01-01 00:04:00 | POINT(X, Y)
1 | 2012-01-01 00:05:00 | POINT(X, Y)
2 | 2012-01-01 00:01:00 |
2 | 2012-01-01 00:02:00 |
2 | 2012-01-01 00:03:00 |
2 | 2012-01-01 00:04:00 |
2 | 2012-01-01 00:05:00 | POINT(X, Y)
3 | 2012-01-01 00:01:00 |
3 | 2012-01-01 00:02:00 | POINT(X, Y)
3 | 2012-01-01 00:03:00 | POINT(X, Y)
3 | 2012-01-01 00:04:00 | POINT(X, Y)
3 | 2012-01-01 00:05:00 | POINT(X, Y)
I've managed to select the correct row using this:
SELECT spid, min(point_time) as point_time
FROM hlhs_day
WHERE point_geom IS NOT NULL
GROUP BY spid;
However when I try to add in the column 'point_geom' to the query, it won't work.
Thanks for your help and suggestions.
James
On Fri, Jul 26, 2013 at 5:24 AM, James David Smith <james.david.smith@gmail.com> wrote: > Hi all, > > Some help if you can please. I have GPS data for a number of people. Each > person turned the GPS on, at home, at a different time of the day. Before > that time, I presume that the person was in their house. I have made a table > for each person for a whole 24 hours (one record per minute), and I now want > to 'fill in' their location for the rows before they turned the GPS on. So > for each person I want to take the first row where the point_geom is not > null, and update all of the rows above it with that value. It's driving me > nuts. > > spid | point_time | point_geom > ----------------------------------------------------------- > 1 | 2012-01-01 00:01:00 | > 1 | 2012-01-01 00:02:00 | > 1 | 2012-01-01 00:03:00 | POINT(X, Y) > 1 | 2012-01-01 00:04:00 | POINT(X, Y) > 1 | 2012-01-01 00:05:00 | POINT(X, Y) > 2 | 2012-01-01 00:01:00 | > 2 | 2012-01-01 00:02:00 | > 2 | 2012-01-01 00:03:00 | > 2 | 2012-01-01 00:04:00 | > 2 | 2012-01-01 00:05:00 | POINT(X, Y) > 3 | 2012-01-01 00:01:00 | > 3 | 2012-01-01 00:02:00 | POINT(X, Y) > 3 | 2012-01-01 00:03:00 | POINT(X, Y) > 3 | 2012-01-01 00:04:00 | POINT(X, Y) > 3 | 2012-01-01 00:05:00 | POINT(X, Y) > > I've managed to select the correct row using this: > > SELECT spid, min(point_time) as point_time > FROM hlhs_day > WHERE point_geom IS NOT NULL > GROUP BY spid; > > However when I try to add in the column 'point_geom' to the query, it won't > work. > > Thanks for your help and suggestions. > > James Usually when I chime in on questions like this, someone comes along with a better solution after I finish. But I think you can get the information you want by "wrapping" that query: SELECT hd1.spid, hd1.point_time, hd2.point_geom FROM (SELECT spid, min(point_time) as point_time FROM hlhs_day WHERE point_geom IS NOT NULL GROUP BY spid) as hd1 INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time = hd2.point_time ORDER BY hd1.spid; I hope this helps. --Mike
Hi Mike,
About half an hour before you replied I actually managed to come up with that myself! :-)
Using it in an update query is troubling me though. I've done it a few times now and it keeps updating the wrong fields.
Luca sent me an email a short while ago with some code to try however, and I'm doing that at the moment. Though it's been running for about 30 minutes now and isn't done. Though there are 230,000 rows, so perhaps that isn't that surprising after all. Here it is for posterity:
PDATE hlhs_day hd
SET point_geom = ( SELECT hd2.point_geom
FROM hlhs_day hd2
WHERE hd.spid = hd2.spid
AND hd2.point_geom IS NOT NULL
AND hd2.point_time = ( SELECT min( hd3.point_time )
FROM hlhs_day hd3
WHERE hd.spid = hd3.spid
AND hd3.point_geom IS NOT NULL
AND hd3.point_time > hd.point_time
)
)
WHERE hd.point_geom IS null
SET point_geom = ( SELECT hd2.point_geom
FROM hlhs_day hd2
WHERE hd.spid = hd2.spid
AND hd2.point_geom IS NOT NULL
AND hd2.point_time = ( SELECT min( hd3.point_time )
FROM hlhs_day hd3
WHERE hd.spid = hd3.spid
AND hd3.point_geom IS NOT NULL
AND hd3.point_time > hd.point_time
)
)
WHERE hd.point_geom IS null
Thanks
James
On 26 July 2013 14:37, Michael Swierczek <mike.swierczek@gmail.com> wrote:
Usually when I chime in on questions like this, someone comes alongOn Fri, Jul 26, 2013 at 5:24 AM, James David Smith
<james.david.smith@gmail.com> wrote:
> Hi all,
>
> Some help if you can please. I have GPS data for a number of people. Each
> person turned the GPS on, at home, at a different time of the day. Before
> that time, I presume that the person was in their house. I have made a table
> for each person for a whole 24 hours (one record per minute), and I now want
> to 'fill in' their location for the rows before they turned the GPS on. So
> for each person I want to take the first row where the point_geom is not
> null, and update all of the rows above it with that value. It's driving me
> nuts.
>
> spid | point_time | point_geom
> -----------------------------------------------------------
> 1 | 2012-01-01 00:01:00 |
> 1 | 2012-01-01 00:02:00 |
> 1 | 2012-01-01 00:03:00 | POINT(X, Y)
> 1 | 2012-01-01 00:04:00 | POINT(X, Y)
> 1 | 2012-01-01 00:05:00 | POINT(X, Y)
> 2 | 2012-01-01 00:01:00 |
> 2 | 2012-01-01 00:02:00 |
> 2 | 2012-01-01 00:03:00 |
> 2 | 2012-01-01 00:04:00 |
> 2 | 2012-01-01 00:05:00 | POINT(X, Y)
> 3 | 2012-01-01 00:01:00 |
> 3 | 2012-01-01 00:02:00 | POINT(X, Y)
> 3 | 2012-01-01 00:03:00 | POINT(X, Y)
> 3 | 2012-01-01 00:04:00 | POINT(X, Y)
> 3 | 2012-01-01 00:05:00 | POINT(X, Y)
>
> I've managed to select the correct row using this:
>
> SELECT spid, min(point_time) as point_time
> FROM hlhs_day
> WHERE point_geom IS NOT NULL
> GROUP BY spid;
>
> However when I try to add in the column 'point_geom' to the query, it won't
> work.
>
> Thanks for your help and suggestions.
>
> James
with a better solution after I finish. But I think you can get the
information you want by "wrapping" that query:
SELECT hd1.spid, hd1.point_time, hd2.point_geom
FROM
(SELECT spid, min(point_time) as point_time
FROM hlhs_day
WHERE point_geom IS NOT NULL GROUP BY spid) as hd1
INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time =
hd2.point_time
ORDER BY hd1.spid;
I hope this helps.
--Mike
James David Smith wrote > So for each person I want to take the first row where the > point_geom is not null, and update all of the rows above it with that > value. It's driving me nuts. *Not tested but the theory has worked for me in the past for similar problems. /* * Useful for situation where you can build an ordered array where the last * cell value represents the current row and in the case where that value is NULL * you want to retrieve the last known non-null value; effectively * carrying that value forward to the current row. */ CREATE OR REPLACE FUNCTION array_last_nonnull(in_array anyarray) RETURNS anyelement AS $$ SELECT unnest FROM ( SELECT unnest, row_number() OVER () AS array_index FROM ( SELECT unnest($1) ) explode ) filter WHERE unnest IS NOT NULL ORDER BY array_index DESC LIMIT 1; $$ LANGUAGE sql STRICT IMMUTABLE ; SELECT id, array_last_nonnull( array_agg(value) OVER (PARTITION BY id ORDER BY idx DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ) , array_agg(value) OVER (PARTITION BY id ORDER BY idx DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (VALUES (1,1,NULL), (1,2,1), (1,3,2), (2,1,NULL), (2,2,1)) val (id, idx, value) ; The function and array_agg window function expression could probably be written differently - i.e., in a "first non-null" manner - but I already have this function installed and understood so I used it instead. The idea to build an array for each ID, using a window function, of all possible values and then pick out of that array the relevant non-null value that you need. Since you are looking forward, not backward, you must use the "RANGE" clause as part of the window function definition. Whether this performs well is another matter but one optimization you can make, if the data supports it, is to limit the size of the frame so that the corresponding array doesn't get too large. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fwd-Tricky-join-and-update-with-same-table-tp5765253p5765284.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On Fri, Jul 26, 2013 at 10:36 AM, James David Smith <james.david.smith@gmail.com> wrote: > Hi Mike, > > About half an hour before you replied I actually managed to come up with > that myself! :-) > > Using it in an update query is troubling me though. I've done it a few times > now and it keeps updating the wrong fields. > > Luca sent me an email a short while ago with some code to try however, and > I'm doing that at the moment. Though it's been running for about 30 minutes > now and isn't done. Though there are 230,000 rows, so perhaps that isn't > that surprising after all. Here it is for posterity: > > PDATE hlhs_day hd > SET point_geom = ( SELECT hd2.point_geom > FROM hlhs_day hd2 > WHERE hd.spid = hd2.spid > AND hd2.point_geom IS NOT NULL > AND hd2.point_time = ( SELECT min( hd3.point_time ) > FROM hlhs_day hd3 > WHERE hd.spid = hd3.spid > AND hd3.point_geom IS NOT NULL > AND hd3.point_time > > hd.point_time > ) > ) > WHERE hd.point_geom IS null > > Thanks > > James > > This is a bit of a kluge solution, but if it were me I would take the query I suggested (and you figured out on your own), put the results in a text file, and then use Excel or some regular expressions in Vim/Emacs/Notepad++/whatever to translate each line from spidx,point_timex,point_geomx to update hlhs_day set point_geom = point_geomx where spid = spidx and point_time = point_timex; Then run it through psql or pgAdmin3. I believe that would probably run much more quickly, and you can break it up into batches. Good luck either way. -Mike > > > > > On 26 July 2013 14:37, Michael Swierczek <mike.swierczek@gmail.com> wrote: >> >> On Fri, Jul 26, 2013 at 5:24 AM, James David Smith >> <james.david.smith@gmail.com> wrote: >> > Hi all, >> > >> > Some help if you can please. I have GPS data for a number of people. >> > Each >> > person turned the GPS on, at home, at a different time of the day. >> > Before >> > that time, I presume that the person was in their house. I have made a >> > table >> > for each person for a whole 24 hours (one record per minute), and I now >> > want >> > to 'fill in' their location for the rows before they turned the GPS on. >> > So >> > for each person I want to take the first row where the point_geom is not >> > null, and update all of the rows above it with that value. It's driving >> > me >> > nuts. >> > >> > spid | point_time | point_geom >> > ----------------------------------------------------------- >> > 1 | 2012-01-01 00:01:00 | >> > 1 | 2012-01-01 00:02:00 | >> > 1 | 2012-01-01 00:03:00 | POINT(X, Y) >> > 1 | 2012-01-01 00:04:00 | POINT(X, Y) >> > 1 | 2012-01-01 00:05:00 | POINT(X, Y) >> > 2 | 2012-01-01 00:01:00 | >> > 2 | 2012-01-01 00:02:00 | >> > 2 | 2012-01-01 00:03:00 | >> > 2 | 2012-01-01 00:04:00 | >> > 2 | 2012-01-01 00:05:00 | POINT(X, Y) >> > 3 | 2012-01-01 00:01:00 | >> > 3 | 2012-01-01 00:02:00 | POINT(X, Y) >> > 3 | 2012-01-01 00:03:00 | POINT(X, Y) >> > 3 | 2012-01-01 00:04:00 | POINT(X, Y) >> > 3 | 2012-01-01 00:05:00 | POINT(X, Y) >> > >> > I've managed to select the correct row using this: >> > >> > SELECT spid, min(point_time) as point_time >> > FROM hlhs_day >> > WHERE point_geom IS NOT NULL >> > GROUP BY spid; >> > >> > However when I try to add in the column 'point_geom' to the query, it >> > won't >> > work. >> > >> > Thanks for your help and suggestions. >> > >> > James >> >> Usually when I chime in on questions like this, someone comes along >> with a better solution after I finish. But I think you can get the >> information you want by "wrapping" that query: >> >> SELECT hd1.spid, hd1.point_time, hd2.point_geom >> FROM >> (SELECT spid, min(point_time) as point_time >> FROM hlhs_day >> WHERE point_geom IS NOT NULL GROUP BY spid) as hd1 >> INNER JOIN hlhs_day hd2 ON hd1.spid = hd2.spid AND hd1.point_time = >> hd2.point_time >> ORDER BY hd1.spid; >> >> I hope this helps. >> --Mike > >
On Fri, Jul 26, 2013 at 4:36 PM, James David Smith <james.david.smith@gmail.com> wrote: > Luca sent me an email a short while ago with some code to try however, and > I'm doing that at the moment. Though it's been running for about 30 minutes > now and isn't done. Though there are 230,000 rows, so perhaps that isn't > that surprising after all. Here it is for posterity: > Ops...I was supposed to hit the "reply all" button! Supposing it is working for you, you can at least split the update into chunks insering a condition on the main query to update only rows within a certain date range (let's say the most recent ones). That will tell you if the query is working properly. That is als the reason why I suggested using a trigger for further inserts: I was suppsoing you had a lot of data and therefore doing an update of chunks when the data is inserted does not make you have to run a very long query. Luca
On 26 July 2013 17:05, Luca Ferrari <fluca1978@infinito.it> wrote:
On Fri, Jul 26, 2013 at 4:36 PM, James David SmithOps...I was supposed to hit the "reply all" button!
<james.david.smith@gmail.com> wrote:
> Luca sent me an email a short while ago with some code to try however, and
> I'm doing that at the moment. Though it's been running for about 30 minutes
> now and isn't done. Though there are 230,000 rows, so perhaps that isn't
> that surprising after all. Here it is for posterity:
>
Supposing it is working for you, you can at least split the update
into chunks insering a condition on the main query to update only rows
within a certain date range (let's say the most recent ones). That
will tell you if the query is working properly.
That is als the reason why I suggested using a trigger for further
inserts: I was suppsoing you had a lot of data and therefore doing an
update of chunks when the data is inserted does not make you have to
run a very long query.
Luca
Hi Luca/Michael,
I'm afraid that your query didn't work. It updated too many rows. I've managed to get the work done that I needed to do using the below queries i.e. using a temporary table. However if anyone can figure out how to roll these into one big query that'd be very useful for me long term please.
SELECT
point_geom,
point_time,
a.spid
INTO
first_moment
FROM
hlhs_day a
JOIN (
SELECT
hlhs_day.spid,
min(hlhs_day.point_time) as min_point_time
FROM
hlhs_day
JOIN
hlhs_day as other1
ON
hlhs_day.point_time = other1.point_time
WHERE
hlhs_day.point_geom IS NOT NULL
GROUP B
hlhs_day.spid
) c
ON
a.point_time = c.min_point_time
AND
a.spid = c.spid
ORDER BY
spid,
point_time;
UPDATE
hlhs_day
SET
point_geom = first_moment.point_geom
FROM
first_moment
WHERE
(hlhs_day.point_time < first_moment.point_time
AND
hlhs_day.spid = first_moment.spid
AND
hlhs_day.point_geom IS NULL);
Thanks
James