Обсуждение: how to speed up these queries ?

Поиск
Список
Период
Сортировка

how to speed up these queries ?

От
Dracula 007
Дата:
Hello,
   I have two "large" tables - "sessions" (about 1.500.000 rows) and
"actions" (about 4.000.000 rows), and the "actions" table is connected
to the "sessions" (it contains a key from it). The simplified structure
of these tables is

sessions (session_id int4,visitor_id int4,session_ip inet,session_date timestamp
)

actions (action_id int4,session_id int4, -- foreign key, references sessions(session_id)action_date
timestamp,action_yearint2,action_month int2,action_day int2
 
)

I run SQL queries like
   SELECT      COUNT(actions.session_id) AS sessions_count,      COUNT(DISTINCT visitor_id) AS visitors_count,
COUNT(DISTINCTsession_ip) AS ips_count   FROM actions LEFT JOIN sessions USING (session_id)   GROUP BY action_year,
action_month,action_day
 

but it's really really slow. I've tried to use different indexes on
different columns, but no matter what I've tried I can't get it faster. 
The explain analyze of the query is

-------------------------------------------------- Aggregate  (cost=347276.05..347276.05 rows=1 width=23) (actual 
time=210060.349..210060.350 rows=1 loops=1)   ->  Hash Left Join  (cost=59337.55..305075.27 rows=4220077 width=23) 
(actualtime=24202.338..119924.254 rows=4220077 loops=1)         Hash Cond: ("outer".session_id = "inner".session_id)
    ->  Seq Scan on actions  (cost=0.00..114714.77 rows=4220077 
 
width=8) (actual time=7539.653..44585.023 rows=4220077 loops=1)         ->  Hash  (cost=47650.64..47650.64 rows=1484764
width=19)
 
(actual time=16628.790..16628.790 rows=0 loops=1)               ->  Seq Scan on sessions  (cost=0.00..47650.64 
rows=1484764 width=19) (actual time=0.041..13378.667 rows=1484764 loops=1) Total runtime: 210061.073 ms
--------------------------------------------------

As you can see it runs for about 4 mins, which is not too fast. Is there 
some way to speed up such queries?


Re: how to speed up these queries ?

От
Sean Davis
Дата:
On Mar 3, 2005, at 6:05 AM, Dracula 007 wrote:

> Hello,
>
>    I have two "large" tables - "sessions" (about 1.500.000 rows) and
> "actions" (about 4.000.000 rows), and the "actions" table is connected
> to the "sessions" (it contains a key from it). The simplified structure
> of these tables is
>
> sessions (
>     session_id int4,
>     visitor_id int4,
>     session_ip inet,
>     session_date timestamp
> )
>
> actions (
>     action_id int4,
>     session_id int4, -- foreign key, references sessions(session_id)
>     action_date timestamp,
>     action_year int2,
>     action_month int2,
>     action_day int2
> )
>
> I run SQL queries like
>
>    SELECT
>       COUNT(actions.session_id) AS sessions_count,
>       COUNT(DISTINCT visitor_id) AS visitors_count,
>       COUNT(DISTINCT session_ip) AS ips_count
>    FROM actions LEFT JOIN sessions USING (session_id)
>    GROUP BY action_year, action_month, action_day
>
> but it's really really slow. I've tried to use different indexes on
> different columns, but no matter what I've tried I can't get it 
> faster. The explain analyze of the query is
>
> --------------------------------------------------
>  Aggregate  (cost=347276.05..347276.05 rows=1 width=23) (actual 
> time=210060.349..210060.350 rows=1 loops=1)
>    ->  Hash Left Join  (cost=59337.55..305075.27 rows=4220077 
> width=23) (actualtime=24202.338..119924.254 rows=4220077 loops=1)
>          Hash Cond: ("outer".session_id = "inner".session_id)
>          ->  Seq Scan on actions  (cost=0.00..114714.77 rows=4220077 
> width=8) (actual time=7539.653..44585.023 rows=4220077 loops=1)
>          ->  Hash  (cost=47650.64..47650.64 rows=1484764 width=19) 
> (actual time=16628.790..16628.790 rows=0 loops=1)
>                ->  Seq Scan on sessions  (cost=0.00..47650.64 
> rows=1484764 width=19) (actual time=0.041..13378.667 rows=1484764 
> loops=1)

It looks like you are going to always do a sequential scan on the 
tables, as you always look a the entire table(s).  How often do you do 
the query as compared to the load on the database?  If you do the query 
often relative to the load, could you keep a table of counts something 
like:

create table summarize_use (action_date date,sessions_count int,visitors_count int,isp_count int)

and then use triggers from the sessions and actions to increment the 
various counts in the summarize_use table based on the action_date and 
session_date date parts?  The summarize_use table would then look like:

action_date    sessions_count    visitors_count    ips_count
3-2-2005            15                12                12

Just a thought, and of course you would pay the price of triggers with 
each insert to the sessions or actions table.

If you don't like that and you run this say every night at midnight, 
you could set up a cron job that ran this query and selected it into a 
table for direct querying--call this table summarize_use again.  Then, 
to get the full result, you would need to simply do something like:

select * from summarize_use
union
SELECT      COUNT(actions.session_id) AS sessions_count,      COUNT(DISTINCT visitor_id) AS visitors_count,
COUNT(DISTINCTsession_ip) AS ips_count   FROM actions LEFT JOIN sessions USING (session_id)   WHERE action_year=2005
ANDaction_month=3 AND action_day=3;
 

This would then be fully up-to-date and would use indices on 
action_year, action_month, action_day.  DISCLAIMER--All of this is 
untested....


Sean.



Re: how to speed up these queries ?

От
Sean Davis
Дата:
On Mar 3, 2005, at 8:34 AM, Dracula 007 wrote:

>
>> It looks like you are going to always do a sequential scan on the 
>> tables, as you always look a the entire table(s).  How often do you 
>> do the query as compared to the load on the database?  If you do the 
>> query often relative to the load, could you keep a table of counts 
>> something like:
>> create table summarize_use (
>>     action_date date,
>>     sessions_count int,
>>     visitors_count int,
>>     isp_count int)
>> and then use triggers from the sessions and actions to increment the 
>> various counts in the summarize_use table based on the action_date 
>> and session_date date parts?  The summarize_use table would then look 
>> like:
>
> I can't exactly predict how often these queries will be executed - 
> this is an example of one of many different queries that are executed 
> over these two tables (in these two tables the main part of statistics 
> of our web application is stored). And for every query there can be 
> different parameter values etc.
>
> It definitely will be less than 1% of all queries run on the server, 
> but the problem is that running of it consumes most of the cpu, so all 
> the other queries have to wait and timeouts. And we don't want the 
> visitors on the web to wait ...
>
> As a temporary solution we build all the possible results once a week 
> (every Monday on 00:00) and store it in a separate tables, so it's 
> pretty fast to view, but it takes about 3 hours to rebuild all the 
> possible stats (total, by month, by day, by week) for different view 
> (pages, articles, visitors, etc). I still hope I'll be able to speed 
> it up somehow.
>
> The solution using a triggers looks nice, I'll try that and it 
> probably will work, but I can't predict how complicated it will be to 
> log all the interesting stats.
>
> t.v.
>

If your group by is at its finest grain only daily, then the most you 
will run these queries is daily, correct?  Could you try some of your 
queries doing:

select BLAH,BLAH,BLAH from BLAH where action_year=2005 and action_day=3 
and action_month=3

after building indices on the columns?  If this is fast (which it very 
well may be), then you can simply run that set of queries daily and 
insert the result into your "big stats" table, as presumably all other 
data in the "big stats" table is static if the date has past.  That 
would save you writing the triggers, which could be complicated from 
what you are saying....

Sean