Обсуждение: Efficiently query for the most recent record for a given user
Let's say I have a table something like this:
create table call_activity (
id int8 not null,
called timestamp,
user_id int8 not null,
primary key (id)
foreign key (user_id) references my_users
)
I want to get the last call_activity record for a single user. Is there ANY way to efficiently retrieve the last record for a specified user_id, or do I need to de-normalize and update a table with a single row for each user each time a new call_activity record is inserted? I know I how to do the query without the summary table (subquery or GROUP BY with MAX) but that seems like it will never perform well for large data sets. Or am I full of beans and it should perform just fine for a huge data set as long as I have an index on "called"?
Thanks in advance!
On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote: > Let's say I have a table something like this: > > create table call_activity ( > id int8 not null, > called timestamp, > user_id int8 not null, > primary key (id) > foreign key (user_id) references my_users > ) > > > I want to get the last call_activity record for a single user. Is there ANY > way to efficiently retrieve the last record for a specified user_id, or do I > need to de-normalize and update a table with a single row for each user each > time a new call_activity record is inserted? I know I how to do the query > without the summary table (subquery or GROUP BY with MAX) but that seems > like it will never perform well for large data sets. Or am I full of beans > and it should perform just fine for a huge data set as long as I have an > index on "called"? Create an index over (user_id, called desc), and do select * from call_activity where user_id = blarg order by called desc limit 1
Claudio Freire <klaussfreire@gmail.com> writes: > On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote: >> I want to get the last call_activity record for a single user. > Create an index over (user_id, called desc), and do > select * from call_activity where user_id = blarg order by called desc limit 1 Note that there's no particular need to specify "desc" in the index definition. This same index can support searches in either direction on the "called" column. regards, tom lane
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql- > performance-owner@postgresql.org] On Behalf Of Claudio Freire > Sent: Wednesday, August 07, 2013 2:20 PM > To: Robert DiFalco > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Efficiently query for the most recent record for a > given user > > On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> > wrote: > > Let's say I have a table something like this: > > > > create table call_activity ( > > id int8 not null, > > called timestamp, > > user_id int8 not null, > > primary key (id) > > foreign key (user_id) references my_users > > ) > > > > > > I want to get the last call_activity record for a single user. Is > > there ANY way to efficiently retrieve the last record for a specified > > user_id, or do I need to de-normalize and update a table with a single > > row for each user each time a new call_activity record is inserted? I > > know I how to do the query without the summary table (subquery or > > GROUP BY with MAX) but that seems like it will never perform well for > > large data sets. Or am I full of beans and it should perform just fine > > for a huge data set as long as I have an index on "called"? > > > Create an index over (user_id, called desc), and do > > select * from call_activity where user_id = blarg order by called desc limit 1 > And most recent call for every user: SELECT id, user_id, MAX(called) OVER (PARTITION BY user_id) FROM call_activity; Regards, Igor Neyman
Thanks guys!
On Wed, Aug 7, 2013 at 11:35 AM, Igor Neyman <ineyman@perceptron.com> wrote:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-
> performance-owner@postgresql.org] On Behalf Of Claudio Freire
> Sent: Wednesday, August 07, 2013 2:20 PM
> To: Robert DiFalco
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Efficiently query for the most recent record for a
> given user
>
> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com>
> wrote:
> > Let's say I have a table something like this:
> >
> > create table call_activity (
> > id int8 not null,
> > called timestamp,
> > user_id int8 not null,
> > primary key (id)
> > foreign key (user_id) references my_users
> > )
> >
> >
> > I want to get the last call_activity record for a single user. Is
> > there ANY way to efficiently retrieve the last record for a specified
> > user_id, or do I need to de-normalize and update a table with a single
> > row for each user each time a new call_activity record is inserted? I
> > know I how to do the query without the summary table (subquery or
> > GROUP BY with MAX) but that seems like it will never perform well for
> > large data sets. Or am I full of beans and it should perform just fine
> > for a huge data set as long as I have an index on "called"?
>
>
> Create an index over (user_id, called desc), and do
>
> select * from call_activity where user_id = blarg order by called desc limit 1
>
And most recent call for every user:
SELECT id, user_id, MAX(called) OVER (PARTITION BY user_id) FROM call_activity;
Regards,
Igor Neyman
On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote: >>> I want to get the last call_activity record for a single user. > >> Create an index over (user_id, called desc), and do >> select * from call_activity where user_id = blarg order by called desc limit 1 > > Note that there's no particular need to specify "desc" in the index > definition. This same index can support searches in either direction > on the "called" column. Yeah, but it's faster if it's in the same direction, because the kernel read-ahead code detects sequential reads, whereas it doesn't when it goes backwards. The difference can be up to a factor of 10 for long index scans. Though... true... for a limit 1... it wouldn't matter that much. But it's become habit to match index sort order by now.
Claudio Freire <klaussfreire@gmail.com> writes: > On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Note that there's no particular need to specify "desc" in the index >> definition. This same index can support searches in either direction >> on the "called" column. > Yeah, but it's faster if it's in the same direction, because the > kernel read-ahead code detects sequential reads, whereas it doesn't > when it goes backwards. The difference can be up to a factor of 10 for > long index scans. Color me skeptical. Index searches are seldom purely sequential block accesses. Maybe if you had a freshly built index that'd never yet suffered any inserts/updates, but in practice any advantage would disappear very quickly after a few index page splits. > Though... true... for a limit 1... it wouldn't matter that much. That's the other point. regards, tom lane
Claudio Freire escribió: > On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Note that there's no particular need to specify "desc" in the index > > definition. This same index can support searches in either direction > > on the "called" column. > > Yeah, but it's faster if it's in the same direction, because the > kernel read-ahead code detects sequential reads, whereas it doesn't > when it goes backwards. The difference can be up to a factor of 10 for > long index scans. That might be true when an index is new, but as it grows, the leaf pages are not going to be sequential anymore. And this doesn't much apply for an equality lookup anyway, does it? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 7, 2013 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, but it's faster if it's in the same direction, because the >> kernel read-ahead code detects sequential reads, whereas it doesn't >> when it goes backwards. The difference can be up to a factor of 10 for >> long index scans. > > Color me skeptical. Index searches are seldom purely sequential block > accesses. Maybe if you had a freshly built index that'd never yet > suffered any inserts/updates, but in practice any advantage would > disappear very quickly after a few index page splits. Maybe. I've tested on pgbench test databases, which I'm not sure whether they're freshly built indexes or incrementally built ones, and it applies there (in fact backward index-only scans was one of the workloads the read-ahead patch improved the most).
On Thu, Aug 8, 2013 at 5:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Aug 7, 2013 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Yeah, but it's faster if it's in the same direction, because the >>>> kernel read-ahead code detects sequential reads, whereas it doesn't >>>> when it goes backwards. The difference can be up to a factor of 10 for >>>> long index scans. >>> >>> Color me skeptical. Index searches are seldom purely sequential block >>> accesses. Maybe if you had a freshly built index that'd never yet >>> suffered any inserts/updates, but in practice any advantage would >>> disappear very quickly after a few index page splits. >> >> Maybe. >> >> I've tested on pgbench test databases, which I'm not sure whether >> they're freshly built indexes or incrementally built ones, and it >> applies there (in fact backward index-only scans was one of the >> workloads the read-ahead patch improved the most). > > It's been a while, but when I was touching the btree code for the > SSI implementation I thought I saw something about a reverse scan > needing to visit the parent page in cases where a forward scan > doesn't, due to the locking techniques used in btree. I don't know > how significant those extra trips up and down the tree are, but > they must cost *something*. From my benchmarks at the time (with pgbench), they seldom ever happen, so even if they cost a lot, they don't add up to much.