Обсуждение: Retrieve most recent 1 record from joined table
Hi all, This is my first question, don't hate me please if it is the wrong place. Mi question is relative seimple, but goes me crazy. First, create some example tables: ---- create table table1( t1_ID SERIAL primary key, t1_value text ); create table table2( t2_ID SERIAL primary key, t2_value text, t1_id integer ); create table table3( t3_ID SERIAL primary key, t3_value text, t3_date timestamp, t2_id integer ); ALTER TABLE table2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON DELETE CASCADE; ALTER TABLE table3 ADD FOREIGN KEY (t2_id) REFERENCES table2 (t2_ID) ON DELETE CASCADE; CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value); CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value); CREATE INDEX IDX_TABLE3_T3_value ON table3 (t3_value); CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date); ----- As you can see, table3 is referenced to table2 and table2 is referenced to table1 Next, insert some data on tables: --one millon records into table1 insert into table1(t1_value) select md5(random()::text) from (select generate_series(1,1000000) as a) as c; --100.000 records on table2 where t1_id is a random number between 1 and 10.001. This guarantees many records pointing to same table1 t1_id insert into table2(t2_value, t1_id) select md5(random()::text), trunc((random()+1)*10000)::integer from (select generate_series(1,100000) as a) as c; --again, 1.022.401 records on table3 where t2_id is a random number between 1 and 10.001. This guarantee many records pointing to same table2 t2_id. -- random numbers (1.022.401) are generated by generated_series function with timestamp (1 minute) interval insert into table3(t3_value, t2_id, t3_date) select md5(random()::text), trunc((random()+1)*10000)::integer, c.date_val from (select generate_series(timestamp '2013-01-01',timestamp '2014-12-12', interval '1 minute') as date_val) as c; So, now we should have table3 with many rows per table2(t2_id) and many rows in table2 with same t1_id. Now, the question: Joining the tables, how to get ONLY most recent record per table3(t3_date)?? Query example: select * from table1 as t1 inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) inner join table3 t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp '2014-08-20') order by t3.t2_id, t3.t3_date desc produces to me this dataset (extract) t1_id t1_value t2_id t2_value t1_id t3_id t3_value t3_date t2_id 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 857683 0fb6cb380522dd83b6ac4beba2c6e98f 2014-08-19 14:42:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 857088 9c3481bfc0bdea51e62b338a1777cde6 2014-08-19 04:47:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 853208 253dc2083e70dd9e276867f43889973f 2014-08-16 12:07:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 851237 970acf901c4232c178b5dbeda4e44ac8 2014-08-15 03:16:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 847436 cb8d52255eaa850f9d7f203092a2ce13 2014-08-12 11:55:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 819339 d9e36ad622b1db499b3f623cdd46a811 2014-07-23 23:38:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 818022 b9d13239f522e3a22f20d36ea6dab8ad 2014-07-23 01:41:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 803046 c36164f3077894a1986c4922dfb632ec 2014-07-12 16:05:00.0 10020 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 788129 32c966feab2212a29f86bebbaa6dfec9 2014-07-02 07:28:00.0 10020 As you can see, there are many t3_id per single t2_id. I need the same dataset, but i want only the most recent (one) record per table3, and, of course, directly in the join rule (in production i have a long and complex query). My personal solution (very slow, i can't use it): select * from table1 as t1 inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) inner join table3 t3 on ( t3.t3_id = (select t3_id from table3 where t2_id = t2.t2_id and t3_date <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only) and t3.t2_id = t2.t2_id ) order by t3.t2_id, t3.t3_date desc it gives me the right result, but performances are poor........ Same bad performance result with a function that performs the table3 query. Suggestions? Thanks to anyone who can answer to me!!!! Best regards, Agharta
<div dir="ltr">Try to use a "GROUP BY " clause in your select, coupled with an aggregate function (MAX in your case ?). Notethat you should not need any subquery, so a single select with inner join should be OK.<br /></div><div class="gmail_extra"><br/><br /><div class="gmail_quote">2014-08-22 10:05 GMT+02:00 agharta <span dir="ltr"><<a href="mailto:agharta82@gmail.com"target="_blank">agharta82@gmail.com</a>></span>:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi all,<br /> This is my firstquestion, don't hate me please if it is the wrong place.<br /><br /> Mi question is relative seimple, but goes me crazy.<br/><br /> First, create some example tables:<br /><br /> ----<br /> create table table1(<br /> t1_ID SERIAL primarykey,<br /> t1_value text<br /> );<br /><br /> create table table2(<br /> t2_ID SERIAL primary key,<br /> t2_value text,<br /> t1_id integer<br /> );<br /><br /><br /> create table table3(<br /> t3_ID SERIAL primarykey,<br /> t3_value text,<br /> t3_date timestamp,<br /> t2_id integer<br /> );<br /><br /> ALTER TABLEtable2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON DELETE CASCADE;<br /> ALTER TABLE table3 ADD FOREIGN KEY(t2_id) REFERENCES table2 (t2_ID) ON DELETE CASCADE;<br /><br /> CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value);<br/> CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value);<br /> CREATE INDEX IDX_TABLE3_T3_value ON table3(t3_value);<br /> CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date);<br /> -----<br /><br /> As you can see, table3is referenced to table2 and table2 is referenced to table1<br /><br /> Next, insert some data on tables:<br /><br />--one millon records into table1<br /> insert into table1(t1_value) select md5(random()::text) from (select generate_series(1,1000000)as a) as c;<br /><br /> --100.000 records on table2 where t1_id is a random number between 1 and10.001. This guarantees many records pointing to same table1 t1_id<br /> insert into table2(t2_value, t1_id) select md5(random()::text),trunc((random()+1)*10000)::<u></u>integer from (select generate_series(1,100000) as a) as c;<br /><br/> --again, 1.022.401 records on table3 where t2_id is a random number between 1 and 10.001. This guarantee many recordspointing to same table2 t2_id.<br /> -- random numbers (1.022.401) are generated by generated_series function withtimestamp (1 minute) interval<br /> insert into table3(t3_value, t2_id, t3_date) select md5(random()::text), trunc((random()+1)*10000)::<u></u>integer,c.date_val from (select generate_series(timestamp '2013-01-01',timestamp '2014-12-12',interval '1 minute') as date_val) as c;<br /><br /><br /> So, now we should have table3 with many rows per table2(t2_id)and many rows in table2 with same t1_id.<br /><br /><br /> Now, the question:<br /><br /><br /> Joining thetables, how to get ONLY most recent record per table3(t3_date)??<br /><br /> Query example:<br /><br /> select * fromtable1 as t1<br /> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') )<br /> inner join table3t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp '2014-08-20')<br /> order by t3.t2_id, t3.t3_date desc<br/><br /> produces to me this dataset (extract)<br /><br /><br /> t1_id t1_value t2_id t2_value t1_id t3_id t3_value t3_date t2_id<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 857683 0fb6cb380522dd83b6ac4beba2c6e9<u></u>8f 2014-08-19 14:42:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 857088 9c3481bfc0bdea51e62b338a1777cd<u></u>e6 2014-08-19 04:47:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 853208 253dc2083e70dd9e276867f4388997<u></u>3f 2014-08-16 12:07:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 851237 970acf901c4232c178b5dbeda4e44a<u></u>c8 2014-08-15 03:16:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 847436 cb8d52255eaa850f9d7f203092a2ce<u></u>13 2014-08-12 11:55:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 819339 d9e36ad622b1db499b3f623cdd46a8<u></u>11 2014-07-23 23:38:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 818022 b9d13239f522e3a22f20d36ea6dab8<u></u>ad 2014-07-23 01:41:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 803046 c36164f3077894a1986c4922dfb632<u></u>ec 2014-07-12 16:05:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34f<u></u>c7 10020 9182c1f48cd008e31d781abc45723a<u></u>10 17098 788129 32c966feab2212a29f86bebbaa6dfe<u></u>c9 2014-07-02 07:28:00.0 10020<br /><br /><br /> As you can see, there are manyt3_id per single t2_id.<br /><br /> I need the same dataset, but i want only the most recent (one) record per table3,and, of course, directly in the join rule (in production i have a long and complex query).<br /><br /> My personalsolution (very slow, i can't use it):<br /><br /> select * from table1 as t1<br /> inner join table2 t2 on (t1.t1_id= t2.t1_id and t2.t2_value like('%ab%') )<br /> inner join table3 t3 on (<br /> t3.t3_id = (select t3_id from table3where t2_id = t2.t2_id and t3_date <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only)<br />and t3.t2_id = t2.t2_id<br /> )<br /> order by t3.t2_id, t3.t3_date desc<br /><br /><br /> it gives me the right result,but performances are poor........<br /><br /> Same bad performance result with a function that performs the table3query.<br /><br /> Suggestions?<br /><br /> Thanks to anyone who can answer to me!!!!<br /><br /> Best regards,<br/><br /> Agharta<span class="HOEnZb"><font color="#888888"><br /><br /><br /><br /><br /><br /><br /><br /><br/><br /><br /><br /> -- <br /> Sent via pgsql-sql mailing list (<a href="mailto:pgsql-sql@postgresql.org" target="_blank">pgsql-sql@postgresql.org</a>)<br/> To make changes to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-sql" target="_blank">http://www.postgresql.org/<u></u>mailpref/pgsql-sql</a><br/></font></span></blockquote></div><br /></div>
<div class="moz-cite-prefix">Hi André,<br /> Thanks a lot for you quickly reply!<br /><br /> Your solution is right, butI'm sorry because i cannot apply a group by with max in my production query: it does other group by and sums.<br /><br/> I've try another solution that gives me a better performances:<br /><br /> WITH filtered_data AS (<br /> select * from (<br /> SELECT *, rank() OVER (PARTITION BY t2_id ORDER BY t3_date DESC) as rank FROM table3<br/> where t3_date <= timestamp '2014-08-20'<br /> ) xx where xx.rank =1<br /> )<br /> select* from table1 as t1<br /> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') )<br /> inner joinfiltered_data t3 on (t3.t2_id = t2.t2_id)<br /> order by t3.t2_id, t3.t3_date desc<br /><br /><br /> So, can it be theright way to solve my problem? There are some side-effects? Could exist a better solution, aka "1.000.000$ question";-) ? <br /><br /><br /> Thanks again for your support!<br /><br /> Best regards,<br /><br /> Agharta<br /><br/><br /><br /> On 08/22/2014 10:13 AM, Brice André wrote:<br /></div><blockquote cite="mid:CAOBG12kfHhS=F-8p1SmpYd+2efDPBY2zUO3TDDEPQrbR-YRQ7w@mail.gmail.com"type="cite"><div dir="ltr">Try to use a "GROUPBY " clause in your select, coupled with an aggregate function (MAX in your case ?). Note that you should not needany subquery, so a single select with inner join should be OK.<br /></div><div class="gmail_extra"><br /><br /><div class="gmail_quote">2014-08-2210:05 GMT+02:00 agharta <span dir="ltr"><<a href="mailto:agharta82@gmail.com" moz-do-not-send="true"target="_blank">agharta82@gmail.com</a>></span>:<br /><blockquote class="gmail_quote" style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi all,<br /> This is my first question,don't hate me please if it is the wrong place.<br /><br /> Mi question is relative seimple, but goes me crazy.<br/><br /> First, create some example tables:<br /><br /> ----<br /> create table table1(<br /> t1_ID SERIAL primarykey,<br /> t1_value text<br /> );<br /><br /> create table table2(<br /> t2_ID SERIAL primary key,<br /> t2_value text,<br /> t1_id integer<br /> );<br /><br /><br /> create table table3(<br /> t3_ID SERIAL primarykey,<br /> t3_value text,<br /> t3_date timestamp,<br /> t2_id integer<br /> );<br /><br /> ALTER TABLEtable2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON DELETE CASCADE;<br /> ALTER TABLE table3 ADD FOREIGN KEY(t2_id) REFERENCES table2 (t2_ID) ON DELETE CASCADE;<br /><br /> CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value);<br/> CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value);<br /> CREATE INDEX IDX_TABLE3_T3_value ON table3(t3_value);<br /> CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date);<br /> -----<br /><br /> As you can see, table3is referenced to table2 and table2 is referenced to table1<br /><br /> Next, insert some data on tables:<br /><br />--one millon records into table1<br /> insert into table1(t1_value) select md5(random()::text) from (select generate_series(1,1000000)as a) as c;<br /><br /> --100.000 records on table2 where t1_id is a random number between 1 and10.001. This guarantees many records pointing to same table1 t1_id<br /> insert into table2(t2_value, t1_id) select md5(random()::text),trunc((random()+1)*10000)::integer from (select generate_series(1,100000) as a) as c;<br /><br /> --again,1.022.401 records on table3 where t2_id is a random number between 1 and 10.001. This guarantee many records pointingto same table2 t2_id.<br /> -- random numbers (1.022.401) are generated by generated_series function with timestamp(1 minute) interval<br /> insert into table3(t3_value, t2_id, t3_date) select md5(random()::text), trunc((random()+1)*10000)::integer,c.date_val from (select generate_series(timestamp '2013-01-01',timestamp '2014-12-12',interval '1 minute') as date_val) as c;<br /><br /><br /> So, now we should have table3 with many rows per table2(t2_id)and many rows in table2 with same t1_id.<br /><br /><br /> Now, the question:<br /><br /><br /> Joining thetables, how to get ONLY most recent record per table3(t3_date)??<br /><br /> Query example:<br /><br /> select * fromtable1 as t1<br /> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') )<br /> inner join table3t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp '2014-08-20')<br /> order by t3.t2_id, t3.t3_date desc<br/><br /> produces to me this dataset (extract)<br /><br /><br /> t1_id t1_value t2_id t2_value t1_id t3_id t3_value t3_date t2_id<br /> 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 857683 0fb6cb380522dd83b6ac4beba2c6e98f 2014-08-19 14:42:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34fc7 100209182c1f48cd008e31d781abc45723a10 17098 857088 9c3481bfc0bdea51e62b338a1777cde6 2014-08-19 04:47:00.0 10020<br/> 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 853208 253dc2083e70dd9e276867f43889973f 2014-08-16 12:07:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34fc7 100209182c1f48cd008e31d781abc45723a10 17098 851237 970acf901c4232c178b5dbeda4e44ac8 2014-08-15 03:16:00.0 10020<br/> 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 847436 cb8d52255eaa850f9d7f203092a2ce13 2014-08-12 11:55:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34fc7 100209182c1f48cd008e31d781abc45723a10 17098 819339 d9e36ad622b1db499b3f623cdd46a811 2014-07-23 23:38:00.0 10020<br/> 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 818022 b9d13239f522e3a22f20d36ea6dab8ad 2014-07-23 01:41:00.0 10020<br /> 17098 74127bc80ca759678892c957b6a34fc7 100209182c1f48cd008e31d781abc45723a10 17098 803046 c36164f3077894a1986c4922dfb632ec 2014-07-12 16:05:00.0 10020<br/> 17098 74127bc80ca759678892c957b6a34fc7 10020 9182c1f48cd008e31d781abc45723a10 17098 788129 32c966feab2212a29f86bebbaa6dfec9 2014-07-02 07:28:00.0 10020<br /><br /><br /> As you can see, there are many t3_id per single t2_id.<br /><br /> I need the same dataset, but i want only the most recent (one) record per table3, and,of course, directly in the join rule (in production i have a long and complex query).<br /><br /> My personal solution(very slow, i can't use it):<br /><br /> select * from table1 as t1<br /> inner join table2 t2 on (t1.t1_id = t2.t1_idand t2.t2_value like('%ab%') )<br /> inner join table3 t3 on (<br /> t3.t3_id = (select t3_id from table3 where t2_id= t2.t2_id and t3_date <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only)<br /> and t3.t2_id= t2.t2_id<br /> )<br /> order by t3.t2_id, t3.t3_date desc<br /><br /><br /> it gives me the right result, but performancesare poor........<br /><br /> Same bad performance result with a function that performs the table3 query.<br /><br/> Suggestions?<br /><br /> Thanks to anyone who can answer to me!!!!<br /><br /> Best regards,<br /><br /> Agharta<spanclass="HOEnZb"><font color="#888888"><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />-- <br /> Sent via pgsql-sql mailing list (<a href="mailto:pgsql-sql@postgresql.org" moz-do-not-send="true" target="_blank">pgsql-sql@postgresql.org</a>)<br/> To make changes to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-sql"moz-do-not-send="true" target="_blank">http://www.postgresql.org/mailpref/pgsql-sql</a><br/></font></span></blockquote></div><br /></div></blockquote><br/>
On 08/22/2014 10:05 AM, agharta wrote: > Hi all, > This is my first question, don't hate me please if it is the wrong place. This is the right place. See below for my answer. > Mi question is relative seimple, but goes me crazy. > > First, create some example tables: > > ---- > create table table1( > t1_ID SERIAL primary key, > t1_value text > ); > > create table table2( > t2_ID SERIAL primary key, > t2_value text, > t1_id integer > ); > > > create table table3( > t3_ID SERIAL primary key, > t3_value text, > t3_date timestamp, > t2_id integer > ); > > ALTER TABLE table2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON > DELETE CASCADE; > ALTER TABLE table3 ADD FOREIGN KEY (t2_id) REFERENCES table2 (t2_ID) ON > DELETE CASCADE; > > CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value); > CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value); > CREATE INDEX IDX_TABLE3_T3_value ON table3 (t3_value); > CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date); > ----- > > As you can see, table3 is referenced to table2 and table2 is referenced > to table1 > > Next, insert some data on tables: > > --one millon records into table1 > insert into table1(t1_value) select md5(random()::text) from (select > generate_series(1,1000000) as a) as c; > > --100.000 records on table2 where t1_id is a random number between 1 and > 10.001. This guarantees many records pointing to same table1 t1_id > insert into table2(t2_value, t1_id) select md5(random()::text), > trunc((random()+1)*10000)::integer from (select > generate_series(1,100000) as a) as c; > > --again, 1.022.401 records on table3 where t2_id is a random number > between 1 and 10.001. This guarantee many records pointing to same > table2 t2_id. > -- random numbers (1.022.401) are generated by generated_series function > with timestamp (1 minute) interval > insert into table3(t3_value, t2_id, t3_date) select md5(random()::text), > trunc((random()+1)*10000)::integer, c.date_val from (select > generate_series(timestamp '2013-01-01',timestamp '2014-12-12', interval > '1 minute') as date_val) as c; > > > So, now we should have table3 with many rows per table2(t2_id) and many > rows in table2 with same t1_id. > > > Now, the question: > > > Joining the tables, how to get ONLY most recent record per > table3(t3_date)?? > > Query example: > > select * from table1 as t1 > inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) > inner join table3 t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp > '2014-08-20') > order by t3.t2_id, t3.t3_date desc > > produces to me this dataset (extract) > > > t1_id t1_value t2_id t2_value t1_id > t3_id t3_value t3_date > t2_id > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 857683 > 0fb6cb380522dd83b6ac4beba2c6e98f 2014-08-19 14:42:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 857088 > 9c3481bfc0bdea51e62b338a1777cde6 2014-08-19 04:47:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 853208 > 253dc2083e70dd9e276867f43889973f 2014-08-16 12:07:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 851237 > 970acf901c4232c178b5dbeda4e44ac8 2014-08-15 03:16:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 847436 > cb8d52255eaa850f9d7f203092a2ce13 2014-08-12 11:55:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 819339 > d9e36ad622b1db499b3f623cdd46a811 2014-07-23 23:38:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 818022 > b9d13239f522e3a22f20d36ea6dab8ad 2014-07-23 01:41:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 803046 > c36164f3077894a1986c4922dfb632ec 2014-07-12 16:05:00.0 10020 > 17098 74127bc80ca759678892c957b6a34fc7 10020 > 9182c1f48cd008e31d781abc45723a10 17098 788129 > 32c966feab2212a29f86bebbaa6dfec9 2014-07-02 07:28:00.0 10020 > > > As you can see, there are many t3_id per single t2_id. > > I need the same dataset, but i want only the most recent (one) record > per table3, and, of course, directly in the join rule (in production i > have a long and complex query). > > My personal solution (very slow, i can't use it): > > select * from table1 as t1 > inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) > inner join table3 t3 on ( > t3.t3_id = (select t3_id from table3 where t2_id = t2.t2_id and t3_date > <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only) > and t3.t2_id = t2.t2_id > ) > order by t3.t2_id, t3.t3_date desc > > > it gives me the right result, but performances are poor........ > > Same bad performance result with a function that performs the table3 query. > > Suggestions? > > Thanks to anyone who can answer to me!!!! You didn't say what version of PostgreSQL you're using. The following solution works with 9.3. select * from table1 as t1 join table2 as t2 on t2.t1_id = t1.t1_id and t2.t2_value like '%ab%' join lateral (select * from table3 where t2_id = t2.t2_id and t3_date <= '2014-08-20' order by t3_date desc limit 1) as t3 on true order by t3.t2_id, t3.t3_date desc; You will need the following indexes for it to give good performance: create index on t2 using gist (t2_value gist_trgm_ops); create index on t3 using btree (t2_id, t3_date desc); That first index will require the pg_trgm extension that you can get with: create extension pg_trgm; If you are not using 9.3, I can give you a (much) more complicated version that will perform well, but ideally you'll want to use the above. -- Vik
agharta schrieb am 22.08.2014 um 10:05: > Joining the tables, how to get ONLY most recent record per table3(t3_date)?? > > Query example: > > select * from table1 as t1 > inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) > inner join table3 t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp '2014-08-20') > order by t3.t2_id, t3.t3_date desc > This seems to be slightly faster, especially with the following index: create index idx_t3_combined on table3 (t2_id, t3_date desc, t3_id); select * from table1 as t1 join table2 t2 on t1.t1_id = t2.t1_id and t2.t2_value like '%ab%' join ( select distinct on (t2_id)t3_id, t3_date, t2_id from table3 order by t2_id, t3_date desc ) t3 on t3.t2_id = t2.t2_id order by t3.t2_id, t3.t3_date desc ; I also had to increase the work_mem in order to avoid disk based sorting for the joins
On 08/23/2014 02:15 PM, Vik Fearing wrote: > On 08/22/2014 10:05 AM, agharta wrote: >> Hi all, >> This is my first question, don't hate me please if it is the wrong place. > This is the right place. See below for my answer. > > >> Mi question is relative seimple, but goes me crazy. >> >> First, create some example tables: >> >> ---- >> create table table1( >> t1_ID SERIAL primary key, >> t1_value text >> ); >> >> create table table2( >> t2_ID SERIAL primary key, >> t2_value text, >> t1_id integer >> ); >> >> >> create table table3( >> t3_ID SERIAL primary key, >> t3_value text, >> t3_date timestamp, >> t2_id integer >> ); >> >> ALTER TABLE table2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON >> DELETE CASCADE; >> ALTER TABLE table3 ADD FOREIGN KEY (t2_id) REFERENCES table2 (t2_ID) ON >> DELETE CASCADE; >> >> CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value); >> CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value); >> CREATE INDEX IDX_TABLE3_T3_value ON table3 (t3_value); >> CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date); >> ----- >> >> As you can see, table3 is referenced to table2 and table2 is referenced >> to table1 >> >> Next, insert some data on tables: >> >> --one millon records into table1 >> insert into table1(t1_value) select md5(random()::text) from (select >> generate_series(1,1000000) as a) as c; >> >> --100.000 records on table2 where t1_id is a random number between 1 and >> 10.001. This guarantees many records pointing to same table1 t1_id >> insert into table2(t2_value, t1_id) select md5(random()::text), >> trunc((random()+1)*10000)::integer from (select >> generate_series(1,100000) as a) as c; >> >> --again, 1.022.401 records on table3 where t2_id is a random number >> between 1 and 10.001. This guarantee many records pointing to same >> table2 t2_id. >> -- random numbers (1.022.401) are generated by generated_series function >> with timestamp (1 minute) interval >> insert into table3(t3_value, t2_id, t3_date) select md5(random()::text), >> trunc((random()+1)*10000)::integer, c.date_val from (select >> generate_series(timestamp '2013-01-01',timestamp '2014-12-12', interval >> '1 minute') as date_val) as c; >> >> >> So, now we should have table3 with many rows per table2(t2_id) and many >> rows in table2 with same t1_id. >> >> >> Now, the question: >> >> >> Joining the tables, how to get ONLY most recent record per >> table3(t3_date)?? >> >> Query example: >> >> select * from table1 as t1 >> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) >> inner join table3 t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp >> '2014-08-20') >> order by t3.t2_id, t3.t3_date desc >> >> produces to me this dataset (extract) >> >> >> t1_id t1_value t2_id t2_value t1_id >> t3_id t3_value t3_date >> t2_id >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 857683 >> 0fb6cb380522dd83b6ac4beba2c6e98f 2014-08-19 14:42:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 857088 >> 9c3481bfc0bdea51e62b338a1777cde6 2014-08-19 04:47:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 853208 >> 253dc2083e70dd9e276867f43889973f 2014-08-16 12:07:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 851237 >> 970acf901c4232c178b5dbeda4e44ac8 2014-08-15 03:16:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 847436 >> cb8d52255eaa850f9d7f203092a2ce13 2014-08-12 11:55:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 819339 >> d9e36ad622b1db499b3f623cdd46a811 2014-07-23 23:38:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 818022 >> b9d13239f522e3a22f20d36ea6dab8ad 2014-07-23 01:41:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 803046 >> c36164f3077894a1986c4922dfb632ec 2014-07-12 16:05:00.0 10020 >> 17098 74127bc80ca759678892c957b6a34fc7 10020 >> 9182c1f48cd008e31d781abc45723a10 17098 788129 >> 32c966feab2212a29f86bebbaa6dfec9 2014-07-02 07:28:00.0 10020 >> >> >> As you can see, there are many t3_id per single t2_id. >> >> I need the same dataset, but i want only the most recent (one) record >> per table3, and, of course, directly in the join rule (in production i >> have a long and complex query). >> >> My personal solution (very slow, i can't use it): >> >> select * from table1 as t1 >> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') ) >> inner join table3 t3 on ( >> t3.t3_id = (select t3_id from table3 where t2_id = t2.t2_id and t3_date >> <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only) >> and t3.t2_id = t2.t2_id >> ) >> order by t3.t2_id, t3.t3_date desc >> >> >> it gives me the right result, but performances are poor........ >> >> Same bad performance result with a function that performs the table3 query. >> >> Suggestions? >> >> Thanks to anyone who can answer to me!!!! > You didn't say what version of PostgreSQL you're using. The following > solution works with 9.3. > > select * > from table1 as t1 > join table2 as t2 on t2.t1_id = t1.t1_id and t2.t2_value like '%ab%' > join lateral (select * > from table3 > where t2_id = t2.t2_id > and t3_date <= '2014-08-20' > order by t3_date desc > limit 1) as t3 on true > order by t3.t2_id, t3.t3_date desc; > > You will need the following indexes for it to give good performance: > > create index on t2 using gist (t2_value gist_trgm_ops); > create index on t3 using btree (t2_id, t3_date desc); > > That first index will require the pg_trgm extension that you can get with: > > create extension pg_trgm; > > > If you are not using 9.3, I can give you a (much) more complicated > version that will perform well, but ideally you'll want to use the above. Hi Vik, Sorry for late, i was testing your solution and.....what that performance!!! I'm using 9.3, i've don't say it, my mistake. Using join lateral with a lot of analize command to see where put the right index, performance becomes amazing!! Thanks again for the priceless suggestion!!! .....join lateral.... what a beautiful keywords :-) Cheers, Agharta