Обсуждение: speed up query with max() and odd estimates

Поиск
Список
Период
Сортировка

speed up query with max() and odd estimates

От
Matthew Nuzum
Дата:
I have this query that takes a little over 8 min to run:
select client,max(atime) as atime from usage_access where atime >=
(select atime - '1 hour'::interval from usage_access order by atime
desc limit 1) group by client;

I think it can go a lot faster. Any suggestions on improving this? DB
is 7.3.4 I think. (There is no index on client because it is very big
and this data is used infrequently.)

explain ANALYZE select client,max(atime) as atime from usage_access
where atime >= (select atime - '1 hour'::interval from usage_access
order by atime desc limit 1) group by client;

      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3525096.28..3620450.16 rows=1271385 width=20)
(actual time=482676.95..482693.69 rows=126 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
time=0.40..0.41 rows=1 loops=1)
           ->  Index Scan Backward using usage_access_atime on
usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
time=0.39..0.40 rows=2 loops=1)
   ->  Group  (cost=3525096.28..3588665.53 rows=12713851 width=20)
(actual time=482676.81..482689.29 rows=3343 loops=1)
         ->  Sort  (cost=3525096.28..3556880.90 rows=12713851
width=20) (actual time=482676.79..482679.16 rows=3343 loops=1)
               Sort Key: client
               ->  Seq Scan on usage_access  (cost=0.00..1183396.40
rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343
loops=1)
                     Filter: (atime >= $0)
 Total runtime: 482694.65 msec


I'm starting to understand this, which is quite frightening to me. I
thought that maybe if I shrink the number of rows down I could improve
things a bit, but my first attempt didn't work. I thought I'd replace
the "from usage_access" with this query instead:
select * from usage_access where atime >= (select atime - '1
hour'::interval from usage_access order by atime desc limit 1);

      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
   Filter: (atime >= $0)
   InitPlan
     ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
time=0.41..0.42 rows=1 loops=1)
           ->  Index Scan Backward using usage_access_atime on
usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
time=0.40..0.41 rows=2 loops=1)
 Total runtime: 481842.47 msec

It doesn't look like this will help at all.

This table is primarily append, however I just recently deleted a few
million rows from the table, if that helps anyone.

--
Matthew Nuzum
www.bearfruit.org

Re: speed up query with max() and odd estimates

От
"Steinar H. Gunderson"
Дата:
On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote:
>  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)

That's a gross misestimation -- four orders of magnitude off!

Have you considering doing this in two steps, first getting out whatever
comes from the subquery and then doing the query? Have you ANALYZEd recently?
Do you have an index on atime?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: speed up query with max() and odd estimates

От
Matthew Nuzum
Дата:
On 4/26/05, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote:
> >  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
>
> That's a gross misestimation -- four orders of magnitude off!
>
> Have you considering doing this in two steps, first getting out whatever
> comes from the subquery and then doing the query? Have you ANALYZEd recently?
> Do you have an index on atime?
>

Yes, there is an index on atime. I'll re-analyze but I'm pretty
certain that runs nightly.

Regarding two steps, are you suggesting:
begin;
select * into temp_table...;
select * from temp_table...;
drop temp_table;
rollback;

I have not tried that but will.

BTW, I created an index on clients just for the heck of it and there
was no improvement. (actually, a slight degradation)

--
Matthew Nuzum
www.bearfruit.org

Re: speed up query with max() and odd estimates

От
John A Meinel
Дата:
Matthew Nuzum wrote:
> I have this query that takes a little over 8 min to run:
> select client,max(atime) as atime from usage_access where atime >=
> (select atime - '1 hour'::interval from usage_access order by atime
> desc limit 1) group by client;
>
> I think it can go a lot faster. Any suggestions on improving this? DB
> is 7.3.4 I think. (There is no index on client because it is very big
> and this data is used infrequently.)
Switch to Postgres 8.0.2 :)

Actually, I think one problem that you are running into is that postgres
(at least used to) has problems with selectivity of date fields when
using a non-constant parameter.

So it isn't switching over to using an index, even though you are
restricting the access time.

I would guess that creating a multi-column index on (client, atime)
*might* get you the best performance.

Try adding the index, and then doing this query:

select atime from usage_access where client = <client_id>
    order by atime desc limit 1;

If you can get that query to use an index, then you can put it in a
loop. Something like:

CREATE FUNCTION last_client_access() RETURNS SETOF time AS '
DECLARE
    client_id INT;
    client_time TIME;
BEGIN
    FOR client_id IN SELECT id FROM <client_list> LOOP
        SELECT INTO client_time atime FROM usage_access
            WHERE client = client_id
            ORDER BY atime DESC LIMIT 1;
        RETURN NEXT client_time;
    END LOOP;
END;
' LANGUAGE plpgsql;

If you really need high speed, you could create a partial index for each
client id, something like:
CREATE INDEX usage_access_atime_client1_idx ON usage_access(atime)
    WHERE client = client1;

But that is a lot of indexes to maintain.

I'm hoping that the multi-column index would be enough.

You might also try something like:

SELECT client, max(atime) FROM usage_access
  WHERE atime > now - '1 hour'::interval
  GROUP BY client;

now is more of a constant, so postgres might have a better time figuring
out the selectivity. I don't know your table, but I assume you are
constantly inserting new rows, and the largest atime value will be close
to now(). Remember, in this query (and in your original query) clients
with their last access time > then 1 hour since the max time (of all
clients) will not be shown. (Example, client 1 accessed yesterday,
client 2 accessed right now your original last atime would be today,
which would hide client 1).

Also, if it is simply a problem of the planner mis-estimating the
selectivity of the row, you can alter the statistics for atime.

ALTER TABLE usage_access ALTER COLUMN atime SET STATISTICS 1000;

I'm not really sure what else to try, but you might start there.

Also, I still recommend upgrading to postgres 8, as I think it handles a
lot of these things better. (7.3 is pretty old).

John
=:->

>
> explain ANALYZE select client,max(atime) as atime from usage_access
> where atime >= (select atime - '1 hour'::interval from usage_access
> order by atime desc limit 1) group by client;
>
>       QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3525096.28..3620450.16 rows=1271385 width=20)
> (actual time=482676.95..482693.69 rows=126 loops=1)
>    InitPlan
>      ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
> time=0.40..0.41 rows=1 loops=1)
>            ->  Index Scan Backward using usage_access_atime on
> usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
> time=0.39..0.40 rows=2 loops=1)
>    ->  Group  (cost=3525096.28..3588665.53 rows=12713851 width=20)
> (actual time=482676.81..482689.29 rows=3343 loops=1)
>          ->  Sort  (cost=3525096.28..3556880.90 rows=12713851
> width=20) (actual time=482676.79..482679.16 rows=3343 loops=1)
>                Sort Key: client
>                ->  Seq Scan on usage_access  (cost=0.00..1183396.40
> rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343
> loops=1)
>                      Filter: (atime >= $0)
>  Total runtime: 482694.65 msec
>
>
> I'm starting to understand this, which is quite frightening to me. I
> thought that maybe if I shrink the number of rows down I could improve
> things a bit, but my first attempt didn't work. I thought I'd replace
> the "from usage_access" with this query instead:
> select * from usage_access where atime >= (select atime - '1
> hour'::interval from usage_access order by atime desc limit 1);
>
>       QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
>    Filter: (atime >= $0)
>    InitPlan
>      ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
> time=0.41..0.42 rows=1 loops=1)
>            ->  Index Scan Backward using usage_access_atime on
> usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
> time=0.40..0.41 rows=2 loops=1)
>  Total runtime: 481842.47 msec
>
> It doesn't look like this will help at all.
>
> This table is primarily append, however I just recently deleted a few
> million rows from the table, if that helps anyone.
>


Вложения

Re: speed up query with max() and odd estimates

От
Matthew Nuzum
Дата:
On 4/26/05, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote:
> >  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
>
> That's a gross misestimation -- four orders of magnitude off!
>
> Have you considering doing this in two steps, first getting out whatever
> comes from the subquery and then doing the query?

Well, I don't know if the estimates are correct now or not, but I
found that your suggestion of doing it in two steps helped a lot.

For the archives, here's what made a drastic improvement:

This batch program had an overhead of 25 min to build hash tables
using the sql queries. It is now down to about 47 seconds.

The biggest improvements (bringing it down to 9 min) were to get rid
of all instances of `select max(field) from ...` and replacing them
with `select field from ... order by field desc limit 1`

Then, to get it down to the final 47 seconds I changed this query:
SELECT client,max(atime) as atime from usage_access where atime >=
(select atime - '1 hour'::interval from usage_access order by atime
desc limit 1) group by client;

To these three queries:
SELECT atime - '1 hour'::interval from usage_access order by atime desc limit 1;
SELECT client, atime into temporary table recent_sessions from
usage_access where atime >= '%s';
SELECT client, max(atime) as atime from recent_sessions group by client;

Thanks for the help.
--
Matthew Nuzum
www.bearfruit.org