Re: Update with last known location?
От | James David Smith |
---|---|
Тема | Re: Update with last known location? |
Дата | |
Msg-id | CAMu32AC33u9eLFq-yB4Py4U2u57nRnhMTuqwaM7dwdUsPg9xsw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Update with last known location? (Erik Darling <edarling80@gmail.com>) |
Ответы |
Re: Update with last known location?
|
Список | pgsql-novice |
You're a star Erik. Thanks again. I'll see how this goes. I posted the question on stack-exchange earlier too by the way and someone replied with the below code. So I'm going to try both. Cheers again. SELECT data.ppid, data.point_time, CASE WHEN data.the_geom IS NULL THEN ( --Get all locations with an earlier time stamp for that ppid SELECT geom.the_geom FROM test_data geom WHERE data.ppid = geom.ppid AND geom.point_time < data.point_time AND geom.the_geom IS NOT NULL AND NOT EXISTS ( -- Cull all but the most recent one SELECT * FROM test_data cull WHERE cull.ppid = geom.ppid AND geom.the_geom IS NOT NULL AND cull.point_time < data.point_time AND cull.point_time > geom.point_time AND cull.the_geom IS NOT NULL ) ) ELSE data.the_geom end FROM test_data data On 30 January 2014 15:24, Erik Darling <edarling80@gmail.com> wrote: > Hi James, > > This is pretty close, but I have to get back to work for the time being. > Feel free to mess with it. If you don't come up with a fully working > solution, I can look at it later tonight (US EST). From your test data, only > one row is missing the correct geom. > > WITH C (id, pt, gm, dr) as ( > SELECT ppid, point_time, the_geom, > dense_rank() over (partition by ppid, the_geom order by ppid, > point_time) as dr > FROM test_data > ) > SELECT id, pt, gm, > CASE WHEN gm IS NULL THEN > LAG(gm, cast(c.DR as int) ) OVER (PARTITION BY id ORDER BY id, pt) > ELSE gm END as gm2 > FROM C > ORDER BY id, pt, gm > > > > On Thu, Jan 30, 2014 at 8:58 AM, James David Smith > <james.david.smith@gmail.com> wrote: >> >> All, >> >> Here's a SQL fiddle of my problem: >> >> http://sqlfiddle.com/#!15/77157 >> >> Thanks >> >> James >> >> On 30 January 2014 11:19, James David Smith <james.david.smith@gmail.com> >> wrote: >> > Hi Erik / all, >> > >> > I don't think that will work, as what happens if one of the people has >> > two missing periods of time within their day? >> > >> > I've made a self-contained example of my problem below. Would you mind >> > trying to produce in the code below what you think I should do? Or if >> > anyone else fancies having a go then please do. >> > >> > I very much appreciate your help by the way. Thank you. I'm really at >> > a loss with this. :-( >> > >> > James >> > -------------------------------------- >> > >> > DROP TABLE test_data; >> > >> > CREATE TABLE test_data( >> > ppid integer, >> > point_time timestamp without time zone, >> > the_geom integer); >> > >> > INSERT INTO test_data VALUES >> > ('1', '2012-01-01 07:00', '1'), >> > ('1', '2012-01-01 07:01', '1'), >> > ('1', '2012-01-01 07:02', '1'), >> > ('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1 >> > ('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1 >> > ('1', '2012-01-01 07:05', '5'), >> > ('1', '2012-01-01 07:06', '5'), >> > ('1', '2012-01-01 07:07', '5'), >> > ('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5 >> > ('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5 >> > ('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5 >> > ('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5 >> > ('2', '2013-05-02 07:12', '24'), >> > ('2', '2013-05-02 07:13', '24'), >> > ('2', '2013-05-02 07:14', '24'), >> > ('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24 >> > ('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24 >> > ('2', '2013-05-02 07:17', '44'), >> > ('2', '2013-05-02 07:18', '44'), >> > ('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44 >> > ('2', '2013-05-02 07:20', '4'), >> > ('2', '2013-05-02 07:21', '4'), >> > ('2', '2013-05-02 07:22', '4'); >> > >> > WITH missing_geoms AS ( >> > SELECT ppid, >> > point_time, >> > the_geom >> > FROM test_data >> > WHERE the_geom IS NULL) >> > --- >> > ,filled_geoms AS ( >> > SELECT ppid, >> > point_time, >> > the_geom >> > FROM test_data >> > WHERE the_geom IS NOT NULL) >> > --- >> > ,partitioned AS ( >> > SELECT missing_geoms.ppid as missing_geoms_ppid, >> > missing_geoms.point_time as missing_geoms_point_time, >> > missing_geoms.the_geom as missing_geoms_the_geom, >> > filled_geoms.ppid as filled_geoms_ppid, >> > filled_geoms.point_time as filled_geoms_point_time, >> > filled_geoms.the_geom as filled_geoms_the_geom >> > FROM missing_geoms >> > LEFT JOIN filled_geoms >> > ON filled_geoms.point_time < missing_geoms.point_time >> > AND filled_geoms.ppid = missing_geoms.ppid >> > ORDER BY missing_geoms_ppid, >> > missing_geoms_point_time) >> > --- >> > SELECT * >> > FROM partitioned; >> > >> > >> > >> > >> > >> > On 29 January 2014 22:10, Erik Darling <edarling80@gmail.com> wrote: >> >> Hi James, >> >> >> >> Yeah, that looks like the right place to me. >> >> >> >> >> >> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith >> >> <james.david.smith@gmail.com> wrote: >> >>> >> >>> Hi Erik, >> >>> >> >>> Do you mean in this section of the SQL? >> >>> ..... >> >>> filled_geoms AS ( >> >>> SELECT >> >>> ppid, >> >>> point_time, >> >>> the_geom >> >>> FROM >> >>> hybrid_location >> >>> WHERE >> >>> the_geom IS NOT NULL) >> >>> ... >> >>> >> >>> Thanks >> >>> >> >>> James >> >>> >> >>> On 29 January 2014 17:57, Erik Darling <edarling80@gmail.com> wrote: >> >>> > Hi James, >> >>> > >> >>> > I think you're still stuck with sort of unnecessary ('too much' ) >> >>> > data >> >>> > coming from the right side of your left join. If so, one option I >> >>> > would >> >>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the >> >>> > filled_geoms table. If you partition by id and order by date >> >>> > descending, >> >>> > you >> >>> > can do an additional d_rank = 1 filter to only get the most recent >> >>> > activity. >> >>> > I believe this is what you want to set your NULL values to, no? >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith >> >>> > <james.david.smith@gmail.com> wrote: >> >>> >> >> >>> >> On 29 January 2014 16:02, Erik Darling <edarling80@gmail.com> >> >>> >> wrote: >> >>> >> > I would re-suggest using a CTE to contain each dataset to ensure >> >>> >> > your >> >>> >> > selects are distilling them correctly, and then using a final >> >>> >> > query >> >>> >> > to >> >>> >> > join >> >>> >> > them. You can then either update your data directly through the >> >>> >> > CTE(s), >> >>> >> > or >> >>> >> > insert the results to another table to do some further testing. I >> >>> >> > think >> >>> >> > you'll find this method presents the data a bit more >> >>> >> > ergonomically >> >>> >> > for >> >>> >> > analysis. >> >>> >> > >> >>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith >> >>> >> > <james.david.smith@gmail.com> wrote: >> >>> >> >> >> >>> >> >> Hi Erik/all, >> >>> >> >> >> >>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed >> >>> >> >> coming >> >>> >> >> from the right side of the join, but it's hard to select only >> >>> >> >> the >> >>> >> >> max >> >>> >> >> from it. Maybe it's possible but I've not managed to do it. Here >> >>> >> >> is >> >>> >> >> where I am, which is so very close. >> >>> >> >> >> >>> >> >> SELECT >> >>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as >> >>> >> >> row_that_needs_geom_updating, >> >>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as >> >>> >> >> last_known_position_time >> >>> >> >> FROM >> >>> >> >> test a >> >>> >> >> INNER JOIN >> >>> >> >> (SELECT ppid, >> >>> >> >> point_time, >> >>> >> >> the_geom >> >>> >> >> FROM test >> >>> >> >> WHERE the_geom IS NOT NULL) b >> >>> >> >> ON b.point_time < a.point_time >> >>> >> >> AND a.ppid = b.ppid >> >>> >> >> WHERE a.the_geom IS NULL; >> >>> >> >> >> >>> >> >> If you see attached screen-print, the output is the rows that I >> >>> >> >> want. >> >>> >> >> However I've had to use DISTINCT to stop the duplication. Also >> >>> >> >> I've >> >>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not >> >>> >> >> sure >> >>> >> >> how. Anyone? >> >>> >> >> >> >>> >> >> But it's kind of working. :-) >> >>> >> >> >> >>> >> >> Worst case if I can't figure out how to solve this in one query >> >>> >> >> I'll >> >>> >> >> have to store the result of the above, and then use it as a >> >>> >> >> basis >> >>> >> >> for >> >>> >> >> another query I think. >> >>> >> >> >> >>> >> >> Thanks >> >>> >> >> >> >>> >> >> James >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80@gmail.com> >> >>> >> >> wrote: >> >>> >> >> > I would try partitioning the second time you call row_number, >> >>> >> >> > perhaps >> >>> >> >> > by >> >>> >> >> > ID, >> >>> >> >> > and then selecting the MAX() from that, since I think the too >> >>> >> >> > much >> >>> >> >> > data >> >>> >> >> > you're referring to is coming from the right side of your >> >>> >> >> > join. >> >>> >> >> > >> >>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith" >> >>> >> >> > <james.david.smith@gmail.com> >> >>> >> >> > wrote: >> >>> >> >> >> >> >>> >> >> >> On 28 January 2014 23:15, Gavin Flower >> >>> >> >> >> <GavinFlower@archidevsys.co.nz> >> >>> >> >> >> wrote: >> >>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote: >> >>> >> >> >> >> >> >>> >> >> >> >> James David Smith <james.david.smith@gmail.com> wrote: >> >>> >> >> >> >> >> >>> >> >> >> >>> Given the data is so large I don't want to be taking the >> >>> >> >> >> >>> data >> >>> >> >> >> >>> out >> >>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd >> >>> >> >> >> >>> like to >> >>> >> >> >> >>> do >> >>> >> >> >> >>> this within the database using SQL. I thought I would be >> >>> >> >> >> >>> able >> >>> >> >> >> >>> to >> >>> >> >> >> >>> do this using a LOOP to be honest. >> >>> >> >> >> >> >> >>> >> >> >> >> I would be amazed if you couldn't do this with a single >> >>> >> >> >> >> UPDATE >> >>> >> >> >> >> statement. I've generally found declarative forms of such >> >>> >> >> >> >> work >> >>> >> >> >> >> to >> >>> >> >> >> >> be at least one order of magnitude faster than going to >> >>> >> >> >> >> either >> >>> >> >> >> >> a >> >>> >> >> >> >> PL >> >>> >> >> >> >> or a script approach. I would start by putting together a >> >>> >> >> >> >> SELECT >> >>> >> >> >> >> query using window functions and maybe a CTE or two to >> >>> >> >> >> >> list >> >>> >> >> >> >> all >> >>> >> >> >> >> the >> >>> >> >> >> >> primary keys which need updating and the new values they >> >>> >> >> >> >> should >> >>> >> >> >> >> have. Once that SELECT was looking good, I would put it >> >>> >> >> >> >> in >> >>> >> >> >> >> the >> >>> >> >> >> >> FROM clause of an UPDATE statement. >> >>> >> >> >> >> >> >>> >> >> >> >> That should work, but if you are updating a large >> >>> >> >> >> >> percentage >> >>> >> >> >> >> of >> >>> >> >> >> >> the >> >>> >> >> >> >> table, I would go one step further before running this >> >>> >> >> >> >> against >> >>> >> >> >> >> the >> >>> >> >> >> >> production tables. I would put a LIMIT on the >> >>> >> >> >> >> above-mentioned >> >>> >> >> >> >> SELECT of something like 10000 rows, and script a loop >> >>> >> >> >> >> that >> >>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the >> >>> >> >> >> >> table. >> >>> >> >> >> >> >> >>> >> >> >> >> -- >> >>> >> >> >> >> Kevin Grittner >> >>> >> >> >> >> EDB: http://www.enterprisedb.com >> >>> >> >> >> >> The Enterprise PostgreSQL Company >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> > James, you might consider dropping as many indexes on the >> >>> >> >> >> > table >> >>> >> >> >> > as >> >>> >> >> >> > you >> >>> >> >> >> > safely can, and rebuilding them after the mass update. If >> >>> >> >> >> > you >> >>> >> >> >> > have >> >>> >> >> >> > lots >> >>> >> >> >> > of >> >>> >> >> >> > such indexes, you will find this apprtoach to be a lot >> >>> >> >> >> > faster. >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> >> >> > Cheers, >> >>> >> >> >> > Gavin >> >>> >> >> >> >> >>> >> >> >> Hi all, >> >>> >> >> >> >> >>> >> >> >> Thanks for your help and assistance. I think that window >> >>> >> >> >> functions, >> >>> >> >> >> and inparticular the PARTITION function, is 100% the way to >> >>> >> >> >> go. >> >>> >> >> >> I've >> >>> >> >> >> been concentrating on a SELECT statement for now and am close >> >>> >> >> >> but >> >>> >> >> >> not >> >>> >> >> >> quite close enough. The below query gets all the data I want, >> >>> >> >> >> but >> >>> >> >> >> *too* much. What I've essentially done is: >> >>> >> >> >> >> >>> >> >> >> - Select all the rows that don't have any geom information >> >>> >> >> >> - Join them with all rows before this point that *do* have >> >>> >> >> >> geom >> >>> >> >> >> information. >> >>> >> >> >> - Before doing this join, use partition to generate row >> >>> >> >> >> numbers. >> >>> >> >> >> >> >>> >> >> >> The attached screen grab shows the result of my query below. >> >>> >> >> >> Unfortunately this is generating alot of joins that I don't >> >>> >> >> >> want. >> >>> >> >> >> This >> >>> >> >> >> won't be practical when doing it with 75,000 people. >> >>> >> >> >> >> >>> >> >> >> Thoughts and code suggestions very much appreciated... if >> >>> >> >> >> needed >> >>> >> >> >> I >> >>> >> >> >> could put together some SQL to create an example table? >> >>> >> >> >> >> >>> >> >> >> Thanks >> >>> >> >> >> >> >>> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time >> >>> >> >> >> ORDER >> >>> >> >> >> BY >> >>> >> >> >> test.point_time) as test_row, >> >>> >> >> >> test.ppid as test_ppid, >> >>> >> >> >> test.point_time as test_point_time, >> >>> >> >> >> test.the_geom as test_the_geom, >> >>> >> >> >> a.ppid as a_ppid, >> >>> >> >> >> a.point_time as a_point_time, >> >>> >> >> >> a.the_geom as a_the_geom, >> >>> >> >> >> a.a_row >> >>> >> >> >> FROM test >> >>> >> >> >> LEFT JOIN ( >> >>> >> >> >> SELECT the_geom, >> >>> >> >> >> ppid, >> >>> >> >> >> point_time, >> >>> >> >> >> row_number() OVER (ORDER BY ppid, point_time) as >> >>> >> >> >> a_row >> >>> >> >> >> FROM test >> >>> >> >> >> WHERE the_geom IS NOT NULL) a >> >>> >> >> >> ON a.point_time < test.point_time >> >>> >> >> >> AND a.ppid = test.ppid >> >>> >> >> >> WHERE test.the_geom IS NULL >> >>> >> >> >> ORDER BY test.point_time) >> >>> >> >> >> >> >>> >> >> >>> >> >> >>> >> Hi Erik / all, >> >>> >> >> >>> >> So I think I've managed to re-write my queries using CTEs. The >> >>> >> below >> >>> >> code now does get me the data that I want from this. But to do so >> >>> >> it >> >>> >> is going to create a frankly huge table in the bit of the SQL where >> >>> >> it >> >>> >> makes the table called 'partitioned'. My rough guess is that it'll >> >>> >> have to make a table of about 100 billion rows in order to get data >> >>> >> I >> >>> >> need ( about 108 million rows). >> >>> >> >> >>> >> Could someone please glance through it for me and suggest how to >> >>> >> write >> >>> >> it more efficiently? >> >>> >> >> >>> >> Thanks >> >>> >> >> >>> >> James >> >>> >> >> >>> >> WITH missing_geoms AS ( >> >>> >> SELECT ppid, >> >>> >> point_time, >> >>> >> the_geom >> >>> >> FROM hybrid_location >> >>> >> WHERE the_geom IS NULL) >> >>> >> ----------------- >> >>> >> ,filled_geoms AS ( >> >>> >> SELECT ppid, >> >>> >> point_time, >> >>> >> the_geom >> >>> >> FROM hybrid_location >> >>> >> WHERE the_geom IS NOT NULL) >> >>> >> ---------------- >> >>> >> ,partitioned AS ( >> >>> >> SELECT missing_geoms.ppid, >> >>> >> missing_geoms.point_time, >> >>> >> missing_geoms.the_geom, >> >>> >> filled_geoms.ppid, >> >>> >> filled_geoms.point_time, >> >>> >> filled_geoms.the_geom, >> >>> >> row_number() OVER ( PARTITION BY missing_geoms.ppid, >> >>> >> missing_geoms.point_time >> >>> >> ORDER BY missing_geoms.ppid, >> >>> >> missing_geoms.point_time, >> >>> >> filled_geoms.ppid, >> >>> >> filled_geoms.point_time DESC) >> >>> >> FROM missing_geoms >> >>> >> LEFT JOIN filled_geoms >> >>> >> ON filled_geoms.point_time < missing_geoms.point_time >> >>> >> AND filled_geoms.ppid = missing_geoms.ppid) >> >>> >> -------------- >> >>> >> SELECT * >> >>> >> FROM partitioned >> >>> >> WHERE row_number = 1; >> >>> >> >> >>> >> James >> >>> > >> >>> > >> >> >> >> > >
В списке pgsql-novice по дате отправления: