Обсуждение: Efficiently query for the most recent record for a given user

Поиск
Список
Период
Сортировка

Efficiently query for the most recent record for a given user

От
Robert DiFalco
Дата:
Let's say I have a table something like this:

   create table call_activity (
        id int8 not null,
        called timestamp,
        user_id int8 not null,
        primary key (id)
        foreign key (user_id) references my_users
    )


I want to get the last call_activity record for a single user.  Is there ANY way to efficiently retrieve the last record for a specified user_id, or do I need to de-normalize and update a table with a single row for each user each time a new call_activity record is inserted? I know I how to do the query without the summary table (subquery or GROUP BY with MAX) but that seems like it will never perform well for large data sets. Or am I full of beans and it should perform just fine for a huge data set as long as I have an index on "called"? 

Thanks in advance!

Re: Efficiently query for the most recent record for a given user

От
Claudio Freire
Дата:
On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
> Let's say I have a table something like this:
>
>    create table call_activity (
>         id int8 not null,
>         called timestamp,
>         user_id int8 not null,
>         primary key (id)
>         foreign key (user_id) references my_users
>     )
>
>
> I want to get the last call_activity record for a single user.  Is there ANY
> way to efficiently retrieve the last record for a specified user_id, or do I
> need to de-normalize and update a table with a single row for each user each
> time a new call_activity record is inserted? I know I how to do the query
> without the summary table (subquery or GROUP BY with MAX) but that seems
> like it will never perform well for large data sets. Or am I full of beans
> and it should perform just fine for a huge data set as long as I have an
> index on "called"?


Create an index over (user_id, called desc), and do

select * from call_activity where user_id = blarg order by called desc limit 1


Re: Efficiently query for the most recent record for a given user

От
Tom Lane
Дата:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
>> I want to get the last call_activity record for a single user.

> Create an index over (user_id, called desc), and do
> select * from call_activity where user_id = blarg order by called desc limit 1

Note that there's no particular need to specify "desc" in the index
definition.  This same index can support searches in either direction
on the "called" column.

            regards, tom lane


Re: Efficiently query for the most recent record for a given user

От
Igor Neyman
Дата:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-
> performance-owner@postgresql.org] On Behalf Of Claudio Freire
> Sent: Wednesday, August 07, 2013 2:20 PM
> To: Robert DiFalco
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Efficiently query for the most recent record for a
> given user
>
> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com>
> wrote:
> > Let's say I have a table something like this:
> >
> >    create table call_activity (
> >         id int8 not null,
> >         called timestamp,
> >         user_id int8 not null,
> >         primary key (id)
> >         foreign key (user_id) references my_users
> >     )
> >
> >
> > I want to get the last call_activity record for a single user.  Is
> > there ANY way to efficiently retrieve the last record for a specified
> > user_id, or do I need to de-normalize and update a table with a single
> > row for each user each time a new call_activity record is inserted? I
> > know I how to do the query without the summary table (subquery or
> > GROUP BY with MAX) but that seems like it will never perform well for
> > large data sets. Or am I full of beans and it should perform just fine
> > for a huge data set as long as I have an index on "called"?
>
>
> Create an index over (user_id, called desc), and do
>
> select * from call_activity where user_id = blarg order by called desc limit 1
>

And most recent call for every user:

SELECT id, user_id, MAX(called) OVER (PARTITION BY user_id) FROM call_activity;

Regards,
Igor Neyman



Re: Efficiently query for the most recent record for a given user

От
Robert DiFalco
Дата:
Thanks guys!


On Wed, Aug 7, 2013 at 11:35 AM, Igor Neyman <ineyman@perceptron.com> wrote:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-
> performance-owner@postgresql.org] On Behalf Of Claudio Freire
> Sent: Wednesday, August 07, 2013 2:20 PM
> To: Robert DiFalco
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Efficiently query for the most recent record for a
> given user
>
> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com>
> wrote:
> > Let's say I have a table something like this:
> >
> >    create table call_activity (
> >         id int8 not null,
> >         called timestamp,
> >         user_id int8 not null,
> >         primary key (id)
> >         foreign key (user_id) references my_users
> >     )
> >
> >
> > I want to get the last call_activity record for a single user.  Is
> > there ANY way to efficiently retrieve the last record for a specified
> > user_id, or do I need to de-normalize and update a table with a single
> > row for each user each time a new call_activity record is inserted? I
> > know I how to do the query without the summary table (subquery or
> > GROUP BY with MAX) but that seems like it will never perform well for
> > large data sets. Or am I full of beans and it should perform just fine
> > for a huge data set as long as I have an index on "called"?
>
>
> Create an index over (user_id, called desc), and do
>
> select * from call_activity where user_id = blarg order by called desc limit 1
>

And most recent call for every user:

SELECT id, user_id, MAX(called) OVER (PARTITION BY user_id) FROM call_activity;

Regards,
Igor Neyman


Re: Efficiently query for the most recent record for a given user

От
Claudio Freire
Дата:
On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> On Wed, Aug 7, 2013 at 3:12 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
>>> I want to get the last call_activity record for a single user.
>
>> Create an index over (user_id, called desc), and do
>> select * from call_activity where user_id = blarg order by called desc limit 1
>
> Note that there's no particular need to specify "desc" in the index
> definition.  This same index can support searches in either direction
> on the "called" column.


Yeah, but it's faster if it's in the same direction, because the
kernel read-ahead code detects sequential reads, whereas it doesn't
when it goes backwards. The difference can be up to a factor of 10 for
long index scans.

Though... true... for a limit 1... it wouldn't matter that much. But
it's become habit to match index sort order by now.


Re: Efficiently query for the most recent record for a given user

От
Tom Lane
Дата:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Note that there's no particular need to specify "desc" in the index
>> definition.  This same index can support searches in either direction
>> on the "called" column.

> Yeah, but it's faster if it's in the same direction, because the
> kernel read-ahead code detects sequential reads, whereas it doesn't
> when it goes backwards. The difference can be up to a factor of 10 for
> long index scans.

Color me skeptical.  Index searches are seldom purely sequential block
accesses.  Maybe if you had a freshly built index that'd never yet
suffered any inserts/updates, but in practice any advantage would
disappear very quickly after a few index page splits.

> Though... true... for a limit 1... it wouldn't matter that much.

That's the other point.

            regards, tom lane


Re: Efficiently query for the most recent record for a given user

От
Alvaro Herrera
Дата:
Claudio Freire escribió:
> On Wed, Aug 7, 2013 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > Note that there's no particular need to specify "desc" in the index
> > definition.  This same index can support searches in either direction
> > on the "called" column.
>
> Yeah, but it's faster if it's in the same direction, because the
> kernel read-ahead code detects sequential reads, whereas it doesn't
> when it goes backwards. The difference can be up to a factor of 10 for
> long index scans.

That might be true when an index is new, but as it grows, the leaf pages
are not going to be sequential anymore.  And this doesn't much apply for
an equality lookup anyway, does it?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Efficiently query for the most recent record for a given user

От
Claudio Freire
Дата:
On Wed, Aug 7, 2013 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, but it's faster if it's in the same direction, because the
>> kernel read-ahead code detects sequential reads, whereas it doesn't
>> when it goes backwards. The difference can be up to a factor of 10 for
>> long index scans.
>
> Color me skeptical.  Index searches are seldom purely sequential block
> accesses.  Maybe if you had a freshly built index that'd never yet
> suffered any inserts/updates, but in practice any advantage would
> disappear very quickly after a few index page splits.

Maybe.

I've tested on pgbench test databases, which I'm not sure whether
they're freshly built indexes or incrementally built ones, and it
applies there (in fact backward index-only scans was one of the
workloads the read-ahead patch improved the most).


Re: Efficiently query for the most recent record for a given user

От
Claudio Freire
Дата:
On Thu, Aug 8, 2013 at 5:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, Aug 7, 2013 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Yeah, but it's faster if it's in the same direction, because the
>>>> kernel read-ahead code detects sequential reads, whereas it doesn't
>>>> when it goes backwards. The difference can be up to a factor of 10 for
>>>> long index scans.
>>>
>>> Color me skeptical.  Index searches are seldom purely sequential block
>>> accesses.  Maybe if you had a freshly built index that'd never yet
>>> suffered any inserts/updates, but in practice any advantage would
>>> disappear very quickly after a few index page splits.
>>
>> Maybe.
>>
>> I've tested on pgbench test databases, which I'm not sure whether
>> they're freshly built indexes or incrementally built ones, and it
>> applies there (in fact backward index-only scans was one of the
>> workloads the read-ahead patch improved the most).
>
> It's been a while, but when I was touching the btree code for the
> SSI implementation I thought I saw something about a reverse scan
> needing to visit the parent page in cases where a forward scan
> doesn't, due to the locking techniques used in btree.  I don't know
> how significant those extra trips up and down the tree are, but
> they must cost *something*.

From my benchmarks at the time (with pgbench), they seldom ever
happen, so even if they cost a lot, they don't add up to much.